Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Underwater Depth Estimation for Spherical Images

Underwater Depth Estimation for Spherical Images Hindawi Journal of Robotics Volume 2021, Article ID 6644986, 12 pages https://doi.org/10.1155/2021/6644986 Research Article Jiadi Cui , Lei Jin, Haofei Kuang, Qingwen Xu, and So ¨ren Schwertfeger Mobile Autonomous Robotic Systems Lab, School of Information Science and Technology, ShanghaiTech University, Shanghai, China Correspondence should be addressed to Jiadi Cui; cuijd@shanghaitech.edu.cn Received 15 December 2020; Accepted 29 May 2021; Published 18 June 2021 Academic Editor: L. Fortuna Copyright © 2021 Jiadi Cui et al. (is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. (is paper proposes a method for monocular underwater depth estimation, which is an open problem in robotics and computer vision. To this end, we leverage publicly available in-air RGB-D image pairs for underwater depth estimation in the spherical domain with an unsupervised approach. For this, the in-air images are style-transferred to the underwater style as the first step. Given those synthetic underwater images and their ground truth depth, we then train a network to estimate the depth. (is way, our learning model is designed to obtain the depth up to scale, without the need of corresponding ground truth underwater depth data, which is typically not available. We test our approach on style-transferred in-air images as well as on our own real un- derwater dataset, for which we computed sparse ground truth depths data via stereopsis. (is dataset is provided for download. Experiments with this data against a state-of-the-art in-air network as well as different artificial inputs show that the style transfer as well as the depth estimation exhibit promising performance. images to estimate depth. In addition, deep learning was also 1. Introduction applied to estimate the depth of underwater images, for Underwater depth estimation is an open problem for marine example, the study in [4] used a convolution neural network robotics [1, 2], which is usually used for 3D reconstruction, (CNN) to generate relative depth, which was then one of the navigation, and intermediate steps for underwater color inputs for a color correction network. Learning-based correlation [3, 4]. Due to the properties of underwater methods are very popular these days, and there are many environments, underwater perception is quite different from applications about depth estimation, for example also in in-air perception. Images captured underwater usually look some microsystems [10, 11]. bluish because longer wavelengths of the visible sunlight are Apart from normal pin-hole cameras, omnidirectional absorbed earlier than shorter wavelengths. Underwater cameras are becoming popular, due to their large field of images may also be more greenish, because of algae in the view (FOV). (ey have been widely used on ground robots water. Besides, the underwater images are more blurred than [12–16]. Some research groups also studied omnidirectional those in-air captured by the same camera, due to turbidity. cameras for underwater use since they provide more in- (ese reasons increase the difficulty of depth estimation formation than perspective ones on object detection, lo- from images. (us, many researchers put effort on under- calization, and mapping. (e study in [17] designed water image processing. For example, using dark channel omnidirectional video equipment and put it on dolphins to priors is proposed to restore underwater images in [5, 6], capture data. (e study in [18] improved on-land omni- inspired by [7] on removing haze in air. (e study in [8] directional cameras for underwater use and proposed the implemented underwater image stitching based on spectral method for camera calibration. methods, which are more robust to turbidity than feature- In addition, the sometimes long visible distances in water based methods. Besides image enhancement, some work increase the region of undefined depth, especially compared focused on depth estimation. (e study in [9] exploited the to indoor scenes, which makes the depth estimation more relationship between depth and blurriness of underwater difficult. Although there are several papers on active 2 Journal of Robotics methods for underwater 3D imaging [19], capturing om- constraints to estimate both depth and surface normals. (e nidirectional underwater depth images remains a big study in [35] investigated the multimodality depth com- pletion task with a self-supervised method by constructing a challenge, which makes ground truth depth unavailable. (is paper proposes to leverage publicly available in-air spherical loss function with photometric constraints, and their images for depth estimation in the underwater domain. method achieved the state of the art (SOTA) on the KITTI Specifically, our approach follows a two-stage pipeline. (i) depth completion benchmark. (e study in [36] exploited Given in-air RGB-D spherical pairs from the Stanford 2D- the bilateral cyclic relationship between stereo disparities 3D-S dataset [20], we train a style-transfer network [21] to and proposed an adaptive regularization scheme to handle convert in-air images to the underwater domain. (ii) Given covisible and occluded problems in a stereo pair. the generated underwater images and their depth maps, we Different from geometric constraints-based methods, train a depth estimation network which is specially designed there are some approaches that try to exploit the constraint for spherical images. During testing, we can generate depth between different modalities, called the wrapped-based directly from the input image. Our approach is unsupervised method. (e study in [37] proposed a wrapped-based method to estimate both depth and pose. (ey designed a in that only underwater images (i.e., no ground truth un- derwater depth) are required for the whole training process. loss based on wrapping nearby views to the target using the Following our preliminary work [22], the main contri- computed depth and pose. (e study in [38] proposed butions of our paper are as follows: monodepth2 to combine depth and camera pose with ge- ometry constraints. To improve the robustness of the model, (i) To the best of our knowledge, we are the first group they also proposed the minimum reprojection loss and to employ CycleGAN to spherical underwater utilized a multiscale sampling method in their framework. images Currently, monodepth2 achieves SOTA results on the KITTI (ii) (is is also the first method to employ deep learning benchmark. Because these methods can predict both depth to estimate depth in spherical underwater images and camera pose, they are wildly used in robotics and self- (iii) We provide a spherical underwater dataset, which driving cars as a visual odometry (VO) system. Zhan et al. investigated the end-to-end unsupervised depth-VO [39] consists of 3,000 high-quality images from the Great Barrier Reef and also integrated the depth with Perspective-n-Point (PnP) method to achieve high robustness [40]. (iv) We provide a benchmark of the proposed network (is idea was also extended to combine more computer with respect to handcrafted images vision tasks. (e study in [41] exploited the content con- sistency between the depth and semantic information. (e study in [42] proposed the GeoNet to utilize the geometric 2. Related Work relationships between depth, optical flow, and camera pose 2.1. Unsupervised Depth Learning. Learning-based methods and use an unsupervised learning framework to predict them. (e study in [43] proposed a competitive collabo- for depth estimation are popular. However, for adversarial environments, such as underwater or forest scenarios, an- ration framework to predict depth, pose, optical flow, and notated data is difficult to obtain. (erefore, supervised motion segmentation parallel with an unsupervised learning has difficulties in achieving a good performance method. with the absence of a large amount of labeled data. Unsu- Currently, unsupervised depth estimation is successful pervised learning and self-supervised learning are two in an indoor or urban scenario. But there are still few ap- methods for utilizing unlabeled data in the learning process. plications in adversarial scenarios. (e study in [44] pro- One reason for using unlabeled data is that producing a posed a generative model and exploited cycle-consistent dataset with clear labels is expensive, but unlabeled data is constraints to train the model in an unsupervised fashion. being generated all time. (e motivation is to make use of (eir method achieves the SOTA on their dataset, but it is the much larger amount of unlabeled data. (e main idea of also hard to implement in real underwater applications and the amount of available data is also not enough for training. self-supervised learning is to generate the labels from un- labeled data, according to the structure or characteristics of the data itself, and train with this unsupervised data through 2.2. Underwater Depth Estimation and Color Correction. a supervised manner. Self-supervised learning is wildly used in representation learning to make a model learn the latent In contrast to on-land scenarios, underwater depth esti- mation is more challenging due to scattering and absorption features of the data. (ese methods are wildly used in computer vision [23–27], video processing [28, 29], and effects [9, 45], as mentioned above. For that, several methods jointly optimize depth estimation and color correction. In robot control [30–32]. (ere is much previous work related to the self-super- other words, accurate depth helps restore image colors and depth can also be estimated from the information of color vised method for depth estimation. In 2017, [33] proposed distortion. For example, the authors of [9, 46] presented an the monodepth framework to exploit epipolar geometry constraints and proposed a novel training loss to train their image formulation model to estimate depth from image blurriness. In [5], a dark channel prior is used for under- model along a self-supervised way. After that, there are some methods related to using geometry constraints to achieve water depth estimation and image restoration to dismiss the attenuation, backscattering effects. (e study in [47] self-supervision. (e study in [34] utilized epipolar geometry Journal of Robotics 3 presented adaptive image dehazing based on the depth 3. Methodology information. Figure 1 demonstrates our two-stage pipeline. (i) Given in- As introduced in Section 2.1 (Unsupervised Depth air RGB-D spherical pairs from the Stanford2D-3D-S Learning), there are many successful learning methods to dataset [20], we train CycleGAN [21] to convert in-air estimate depth for in-air images. (us, a naive way to es- images to the underwater domain. (ii) Given the generated timate underwater depth is to restore underwater images to underwater images and their depth maps, we train a depth in-air style so that this depth learning strategy can be ap- estimation network to learn depth. In the following, we plied. In [48], such a strategy proves to be efficient in un- introduce the two parts separately. derwater depth estimation. Both deep learning and mathematical methods are very popular for image resto- ration. In [49], they use the Jaffe-McGlamery model [50, 51], 3.1. Style Transfer. Generative adversarial nets (GANs) are a mathematical method, to handle the problems, which designed for data augmentation and are now widely used in decreases the absorption and scattering effects based on style-transfer tasks. GANs are two-player mini-max games irradiance and depth. In [52], a learning-based method was between a generative model G and a discriminative model D proposed to solve depth estimation and color correction in [55]. (e value function about this adversarial process is spherical domains at the same time by solving left-right min max V(D, G) � E [logD(x)] consistency under a multicamera setting. However, deep x∼p (x) data G D (1) learning usually requires a large amount of data, which is not + E [log(1 − D(G(z)))], available for the underwater field. To overcome this problem, z∼p (z) the study in [4] proposed a generative adversarial network to where p denotes the features in the data and p holds data z generate synthetic underwater images from in-air datasets. random values at first. (is value function is also the loss Our work is inspired by WaterGAN [4], but also different function for the deep neural network. from it. WaterGAN requires depth as input to simulate the (e underwater style-transfer algorithm CycleGAN [21] attenuation and scattering effect, while our underwater GAN consists of two networks, a network G for forward mapping only needs underwater and in-air images as input. Our pre- and a network F for inverse mapping. Given input images, liminary work is reported in [22], where we proposed the two- network G converts to the target domain and network F stage pipeline to solve underwater omnidirectional depth es- converts back to the original domain. A cycle consistency is timation. In the first perspective image pipeline, the Water- enforced as F(G(X)) ≈ X and vice versa, to ensure the GAN [4] was used to transfer RGB-D images to underwater mappings will be constrained well. (us, the loss function of RGB-D images. (en, a fully convolutional residual network the forward mapping function G: X ⟶ Y is (FCRN) [53] depth estimation network was trained with the underwater image as input. In the second omnidirectional L G, D , X, Y􏼁 � E 􏼂logD (y)􏼃 GAN Y y∼p (y) Y data stage, we synthesized images from in-air equirectangular + E 􏼂log 1 − D (G(x))􏼃. x∼p (x) Y images to underwater equirectangular images by decreasing data the values in the red channel (due to its short wavelength (2) nature in the underwater environment) and blurring the image We use X as input to domain D and Y as input to based on its distance to the camera origin. Finally, inspired by domain D . Examples of our input images from the two [54], a distortion-aware convolution module replaced the Y domains are demonstrated in Figures 2 and 3. Since both our normal convolution in the FCRN based on the spherical input and output operate under the spherical domain, we longitude-latitude mapping. In this work, we replace the directly adopt the network with no modification to the simple operations in the red channel with a learning method to convolution operators. generate synthetic underwater omnidirectional images. In Moreover, CycleGAN applies a new idea about cycle addition, we improve the method to estimate underwater consistency, which is y ⟶ F(y) ⟶ G(F(y)) ≈ y. And depth. Finally, we are more thoroughly evaluating the results of the loss function on this step is our algorithm, by estimating ground truth depths for dis- tinctive feature points. In [54], the FCRN [53] was identified as L (G, F) � E 􏼂‖F(G(x)) − x‖ 􏼃 cyc x∼p (x) 1 data the state-of-the-art (SOTA) network for omnidirectional (3) + E 􏼂‖G(F(y)) − y‖ 􏼃. CNNs, and we thus adopt it and compare to it in this paper. y∼p (y) 1 data We want to emphasize that, in general, depth estimation Finally, the full objective for CycleGAN is from a single RGB image is a very challenging problem. As our experiments later will show, our approach does not give L G, F, D , D 􏼁 � L G, D , X, Y 􏼁 X Y GAN Y very accurate estimates, neither do the other depth esti- + L F, D , Y, X􏼁 GAN X mation approaches mentioned in this section. Also, as with (4) any monocular vision problem, our results are up to an + λL (G, F), cyc unknown scale factor. Nevertheless, we believe this work to ∗ ∗ G , F � arg min max L G, F, D , D 􏼁. X Y be worthwhile because it paves a path towards potentially G,F D ,D x Y more successful approaches (see the future work) and, even not being very accurate, has potential use cases, for example Because the method is pixel-to-pixel, the dataset is preprocessed by resizing the images into a reasonable size. in navigation or color correction. 4 Journal of Robotics Training: Fully convolutional residual network Corresponding In-air image RGB in-air depth + new loss function Depth prediction CycleGAN Underwater RGB Synthetic underwater images Testing: FCRN Test image Depth prediction Figure 1: Full pipeline of our approach. We propose to leverage publicly available RGB-D datasets for style transfer and depth estimation in an unsupervised approach. underwater image. We can then train our network with the converted X and D pairs. Following the recent success of depth estimation in the spherical domain [57], we adopt FCRN, one of the state-of- the-art single models on NYUv2 [53]. (e network consists of a feature extraction model and then several upconvolu- tions layers to increase the resolution. Here, an UNet [58] is used as the backbone in all our experiments. Finally, the L1 difference will be calculated between the output depth and ground truth depth maps: � � � � � � Figure 2: A typical underwater omnidirectional image. � � L � 􏽘 D − D , depth � pred gt� (6) d∈x,y Compared with WaterGAN, the CycleGAN only needs where D denotes the prediction of the network, D pred gt underwater and in-air images as input, whereas WaterGAN denotes the ground truth depth map, and x, y enumerate all requires depth as input to simulate the attenuation and the pixels in the input image. scattering effects. Smoothness regularization has been used frequently for depth estimation in planar images in previous research 3.2. Depth Estimation. With the recent success of con- [33, 38] to encourage the estimated depths to be locally volutional neural networks, different CNN-based ap- similar. For depth estimation in perspective images, the term proaches are proposed to solve the supervised depth is defined as follows: � � estimation task [53, 56]. However, most of the above ap- � � � � L � 􏽘 􏽘 �∇ D p􏼁 � , sm d t t proaches require large amounts of accurate image and (7) p d∈x,y ground truth depth pairs, currently unavailable in the spherical underwater domain. Instead, we propose to le- where L is a smoothness term that penalizes the L1 norm sm verage an available in-air spherical dataset, the Stanford 2D- of first-order depth gradients along both the x and y di- 3D-S benchmark [20], and convert it to underwater style rections in 2D space. with StyleGAN. Specifically, given X , D pairs from the raw i i (e equirectangular projection of a 360 image, however, Stanford 2D-3D-S benchmark, we first convert X to the is with distortion, and directly leveraging depth smoothness underwater domain X : i terms means we must impose larger weights for the point pairs with larger latitudes. Simply combining the above loss X � CycleGAN X , (5) i i designed for perspective images into the training process where X denotes the original in-air image from the dataset, might lead to suboptimal results. (e reason is that the D its corresponding depth, and X is the converted equirectangular projection of spherical images oversamples i i Journal of Robotics 5 Figure 3: Generated images with our CycleGAN. (a) On the left are examples from Domain I (in-air). (b) On the right are our generated images. We are able to produce the lightening color effects from the original underwater dataset. the image in the polar regions. Taking inspiration from the exact ground truth to quantitatively evaluate the algorithm. recent work of learning in the spherical domain [59], we Here, we also compare to the SOTA algorithm for in-air propose that the weight of the distance of two points is based spherical images: FCRN [53], in two setups. We test FCRN on their spherical distance, after which we arrive at the with the synthetic (GAN) images, as well as with the original following spherical depth smoothness regularizer: RGB images as input. All algorithms are trained using the synthetic underwater images. (e second experiment uses Θ,Φ � � sph � � real omnidirectional underwater images and sparse ground � � L � 􏽘 􏽘 ω ∇ D p􏼁 , � � (8) sm θ,ϕ d t t truth points estimated via bundle adjustment to test the θ�0,ϕ�0 algorithm with in situ data. where ω are the weights for each point and ω ∝Ω(θ, ϕ). In the following, we first introduce the datasets, θ,ϕ i,j Ω(θ, ϕ) is the solid angle corresponding to the sampled area hyperparameters, and evaluation metrics used in the sph on the depth map located at (θ, ϕ). L is a spatial experiments. sm smoothness term that penalizes the L1 norm of second-order depth gradients along both the θ and ϕ directions in 2D 4.1. Datasets. Stanford 2D-3D-S [20] is one of the standard space. benchmarks for in-air datasets. (e dataset provides om- Our final loss is a weighted combination of the above nidirectional RGB images and corresponding depth infor- factors with λ as the weighting factor: mation, which is necessary data for depth estimation sph (9) training. Furthermore, it also provides semantics in 2D and L � L + λ L . depth 1 sm 3D, 3D mesh, and surface normals. In addition, we use a dataset that we collected by scuba diving in the Great Barrier Reef. We use this for training our 4. Experimental Details CycleGAN with original, spherical underwater images as We evaluate our approach with two experiments. Firstly, we well as for testing our approach. (is omnidirectional use the synthetic underwater Stanford 2D-3D-S dataset with dataset for style transfer and testing was collected with an 6 Journal of Robotics Insta360 ONE X (https://www.insta360.com/product/ insta360-onex) camera at depths between 1 m and 25 m. To evaluate the final results from our two-stage pipeline, the ground truth depth of the underwater scenario is gen- erated based on epipolar geometry. (e generation steps are as follows: firstly, a pair of stereo images with a known baseline are used to estimate sparse map points by feature matching, five-point algorithm [60], and triangulation [61]. (en, two pairs of stereo images, taken at different times, with big enough spatial disparity, including the one for map points, are used to fine-tune the position of the map points Figure 4: An example of ground truth points. (e picture is with bundle adjustment. Finally, the depth of these map captured by Insta360 ONE X camera at real ocean scenarios. Green points is normalized to 0 to 255 and used as up-to-scale points represent the interest points, whose depths are calculated by ground truth. stereopsis. Figure 4 shows an example of points (green dots) that are used as ground truth. It can be seen that most of these points be denoted by (i, j), we find the corresponding depth in the are on the reef instead of water because the open water and ground truth and estimated depth. (e result of our esti- the surface do not have feature points. (ough only sparse mation is up to an unknown scale factor. We thus minimize points are generated, we believe that they are sufficient for the error by calculating the best fitting scale factor for the the evaluation of our depth results. On the underwater ground truth. To do so, we calculate the scale parameter dataset used for evaluation, we generate about 100 points for between each pair of ground truth and result and then get each image. the median factor. To be more specific, in one pair of ground truth and result, there is the ratio of the ground truth value 4.2. Hyperparameters. (e hyperparameters for the style P (i, j) to the result value P(i, j) for each point pairs. (en, gt transfer include the resolution of input images, which is set using these ratios for one image, we can calculate their to 512 × 256 pixel. We then train the CycleGAN [21] with median s to simulate the optimization procedure, like the these hyperparameters: learning rate (2e-4) and number of least-square method, and set the median s as the scale pa- epochs (8). rameter between the ground truth and result. Finally, we We implement the FCRN for depth estimation with the rescale the result and compute the error E about each point. PyTorch framework and train our network with the fol- (e error E about each image is calculated by 􏼌 􏼌 lowing hyperparameters settings during pretraining: mini- 􏼌 􏼌 􏼌 􏼌 􏼌 􏼌 P (i, j) − s · P(i, j) 􏼌 􏼌 gt batch size (8), learning rate (1e-2), momentum (0.9), weight ⎛ ⎝ ⎞ ⎠ E � Q , if P (i, j)≠ 0. (10) 1/2 gt decay (0.0005), and number of epochs (50). We gradually P (i, j) gt reduce the learning rate by 0.1 every 10 epochs. Finally, we tune the whole network with learning rate (1e-4) for another Here, the operation Q is to calculate the median of all 1/2 20 epochs. λ is set to 1e-4 in all our experiments. cases for the ground truth points and the result points. 5. Results 4.3. Metrics. For our depth estimation network, we adopt FCRN [53] and compare the model with the initial loss In this section, we will demonstrate the results on the function and our new loss function. Apart from these two converted Stanford 2D-3D-S dataset and real underwater networks, we also use FCRN based on the original in-air images collected in the Great Barrier Reef. images, which are not processed by CycleGAN. For eval- uation, we use the following common metrics for the 5.1. Evaluation of Synthetic Images. Since there are few comparisons on the datasets mentioned above: root mean 􏽱���������������� underwater datasets with ground truth depth, we synthesize square error (RMS) (1/T)􏽐 (g − z ) , mean relative p p p underwater style images from the Stanford 2D-3D-S dataset. error (Rel) (1/T)􏽐 (‖g − z ‖/g ), mean log 10 error CycleGAN [21] is used to generate synthetic underwater p p p p (log 10 ) (1/T)􏽐 ‖log g − log z ‖, and pixel accuracy as images in this work. Figure 4 shows several examples of the p 10 p 10 p gt gt synthetic images. It can be seen the generated images suc- the percentage of pixels with max((z /z ), (z /z ))< δ for i i i i 2 3 cessfully transfer the in-air images to underwater style, es- δ ∈ [1.25, 1.25 , 1.25 ]. T denotes the numbers of pixels and pecially with respect to color. g and z represent the ground truths and the depth map p p One interesting phenomenon during the transfer is that predictions, respectively. if we attempt to train for many epochs in the style-transfer network, a lot of unnecessary and unreasonable features are 4.4. Metric for Real Experiment. To evaluate the final results also learned. However, in most cases, we just need to transfer of our two-stage approach, we rely on the sparse ground some specific features, like color. (e testing on our own truth points captured with the approach described in Section underwater dataset revealed that the estimation results for 4.1. (Datasets). For all nonzero points, whose positions will some water-only parts are not accurate enough. (is may Journal of Robotics 7 Figure 5: Generated depth from style-transferred underwater Stanford 2D-3D-S dataset. (a) On the left are the input images. (b) On the right are the corresponding predicted depth maps. also be due to the fact that indoor scenarios are too different of our knowledge, we are the first to propose an algorithm from the underwater domain. for depth estimation on spherical underwater images. Fig- ure 6 demonstrates the estimated depth on our underwater Figure 5 presents the results of the estimated depth from the synthetic underwater Stanford 2D-3D-S dataset, where dataset. Similarly, it can be seen that the brighter parts on the right correspond to areas more far away on the right of brighter pixels represent a larger depth and darker pixels are closer. It can be seen that the estimated depths on the right of Figure 6, which implies that the network at least estimates Figure 5 corresponding to the left image are acceptable, es- the depth correctly in some regions. pecially the further area. Additionally, Table 1 gives a more Because our network is based on the Stanford 2D-3D-S rigorous evaluation of the results. Comparing to the classic dataset, in which the original images are all lacking the upper FCRN network, our improved loss function gives slightly and lower parts (15.6% of the image height for each part), better results as indicated by the smaller RMS, Rel, and log10. these parts are filled with pure black pixels. (erefore, the It can also be seen from the FCRN RGB experiment that upper and lower parts in the final results about underwater using RGB images for training the SOTA network gives far depth estimation are also not evaluated. In the other words, we only use panorama images instead of spherical images worse results compared to ours and also to FCRN trained with GAN images. Because the style-transferred images actually. mainly imitate the color information, the network was (ough our underwater dataset does not have ground adopted to estimate the depth information from these truth depth maps, we can evaluate the results with the sparse images. map points. We randomly choose 20 images to test with the corresponding ground truth calculated by stereopsis. According to the metric presented, the results are shown 5.2. Evaluation of Real Underwater Images. After achieving in the first row of Table 2. (ere, each column shows results acceptable results on the synthetic dataset, we also evaluate averaged over all images. In the first column, we take the the results on the real underwater images. Note that we median of the errors of all pixels for which we have ground cannot compare to any other methods here, since, to the best truth in that image, in the second column we take the mean 8 Journal of Robotics Table 1: Performance comparison on 1412 images from the Stanford 2D-3D-S dataset. 2 3 Methods RMS (m) ↓ Rel (m) ↓ log10 ↓ δ< 1.25↑ δ< 1.25 ↑ δ< 1.25 ↑ sph Ours: + L 0.683 0.177 0.075 0.744 0.919 0.972 grad FCRN GAN 0.687 0.181 0.078 0.737 0.920 0.972 FCRN RGB 1.281 0.327 0.181 0.387 0.648 0.801 All tests use images transformed with GAN as input. Our approach and FCRN GAN were trained with synthetic images, while FCRN RGB uses, for comparison, RGB images as training data. (e terms are explained below. (e arrows indicate that smaller (↓) or bigger (↑) values are better. Figure 6: Generated depth from our underwater dataset. (a) On the left are the input images. (b) On the right are the corresponding predicted depth maps. We can find the upper and lower parts (15.6% of the image width for each part) are not good, and the reasons are shown in Evaluation Section. Table 2: Performance comparison between the ground truth and various results. Results types Average median error Average mean error Average standard deviation Ours 0.22 0.40 0.62 FCRN (trained with RGB) 0.30 3.76 7.16 Black result 1.00 1.00 0.00 White result 0.95 1.10 0.65 Random noise result 0.96 2.83 3.31 Gray-scale result 0.95 1.10 7.12 Black input 0.27 3.75 7.18 White input 0.31 3.70 6.91 Random noise input 0.32 3.77 7.00 Gray-scale input 0.24 0.51 1.26 More details are shown in the Supplementary Material. Journal of Robotics 9 generated for meaningless data. But looking at the average error in each pixel, and the last column shows the standard deviation in each image, each averaged over all images. We mean error and standard deviation, we see that those generated depth maps have a very big error, thus showing can see that the average median error is 22% of the estimated depth, with a mean error of 40% and a standard deviation of that our result is clearly much better. 62%. Of course, those values show that the estimated depth is In the last row, we use the gray-scale version of the color quite inaccurate. Nevertheless, we believe that they are still frame as the input. As could be expected, this has reasonable, somewhat useful for certain applications, for example, nav- second-best results. Nevertheless, it is still worse than the igation, colorization, dehazing, or location fingerprinting. color input, so the color seems to be important. Comparing Furthermore, we hope that, in the future, those values can be the result of our method to all other tests, we see that the improved, for example by better and more training data and average median error, average mean error, and average by providing a few consecutive or stereo frames as input. standard deviation are much better for our approach, clearly showing that our approach does work to a certain extend. In order to better understand the properties of our ap- proach and put the evaluation results for our method into perspective, we use the same test frames to compare with 6. Conclusions three other cases. (e new row in Table 2 shows the results of the original FCRN, trained with the normal RGB images from (is paper presented a supervised depth learning method for Stanford 2D-3D-S. When testing this network with our real underwater spherical images. Firstly, we implemented style underwater data, we see that the average mean error and the transfer based on CycleGAN to synthesize the underwater average standard deviation are very big, compared to our images. (e results show that CycleGAN learned the features proposed approach. (is shows that using the CycleGAN of underwater scenarios and synthesizes nice images in the synthetic images during training is very advantageous. Even underwater style. (ose images are then used in a second though this does not prove that the CycleGAN provides a very network, a Fully Convolutional Residual Network (FCRN), realistic underwater transfer, it is a very strong indication to train underwater spherical depth estimation. (e network towards it. is trained in a supervised manner. Our first experiment was (e other two cases we show in Table 2 aim to show that using the synthetic images from CycleGAN for evaluation our approach is indeed doing something useful and not just and comparison with FCRN. Furthermore, we tested our giving some random values. Firstly, we make four different method on real underwater data from the Great Barrier Reef, fake depth results for comparison. (e “black result” depth for which we estimated sparse ground truth depth points image is all black (0 distance), the “white result” depth image using stereopsis and bundle adjustment. We also compared is all white, and the “random noise result” depth image has our results to artificial input and output data, to show that random distances. Finally, there is also a depth image called the network is indeed performing depth estimation. (e “gray-scale result,” which is simply the input underwater experiments demonstrated that the style transfer, as well as image in grey scale. Please note that, in the “black result” the depth estimation results, is convincing. Our method case, there are all 0’s in the image, so the scale parameter s achieves better results than training without GAN. It ach- cannot be obtained by the metric presented above. However, ieves slightly better results than FCRN trained with GAN, so any scale that acts on 0 is itself. (us, we just change the our updated loss function is beneficial. (e experiments also metric to a specific way, that is, setting scale parameter s � 1. showed that the estimated depth on real underwater images (en, the error in that case is always 1; thus, the standard is somewhat reasonable and better than all other methods deviation is 0. We can see that the evaluations of all those and options we compared to. fake results are much worse than our result. Nevertheless, the approach is far from perfect, especially Secondly, we used the same data as above (black, white, regarding the accuracy of the estimated depth. (is is mainly random noise, gray-scale input image) as the input to our due to the fact that estimating the depth from a single image is approach. (is can be regarded as a test to see if the network a very challenging task. Our approach is also not very general. is overfitting too much. Generating good results on (e underwater dataset was taken only at one location with meaningless data would be a clear indication of overfitting, very good visibility. (ere are many more underwater sce- for example, because the training data is not diverse enough. narios with differing styles. So, more underwater training data We can see that the average median error is in the range of is needed. In the future, we plan to work on a unified ap- our result. We think this is due to two reasons: (i) provided proach that can work in all kinds of different underwater with meaningless data, the network seems to generate depth situations. In addition, for testing in the real underwater images that somewhat resemble typical depth images; thus, it environments, we also plan to mask water-only areas by a might be overfitting a bit. (ii) (e rescaling process of our segmentation process. Collecting an in-air dataset with depth evaluation is optimizing the generated depth maps, such that that looks closer to the underwater images might also further they best fit the ground truth (for the underwater image that improve our performance. (ose might be some canyons or is not being used here). (e median error of that ground deserts. Since the underwater data we collected actually also truth may be quite small for those “typical” depth images contains spherical videos from two more cameras, we will 10 Journal of Robotics Pattern Analysis and Machine Intelligence, vol. 33, no. 12, investigate using this stereo data for depth training. Fur- pp. 2341–2353, 2011. thermore, more complicated network structures that take [8] M. Pfingsthorn, A. Birk, S. Schwertfeger, H. Bulow, ¨ and previous frames into account may provide even better results. K. Pathak, “Maximum likelihood mapping with spectral image registration,” in Proceedings of the 2010 IEEE Inter- Data Availability national Conference on Robotics and Automation, pp. 4282– 4287, Anchorage, AK, USA, May 2010. (e images of the underwater dataset, including the data [9] Y.-T. Peng, X. Zhao, and P. C. Cosman, “Single underwater for the ground truth evaluation, can be found on https:// image enhancement using depth estimation based on blur- robotics.shanghaitech.edu.cn/static/datasets/underwater/UW_ riness,” in Proceedings of the 2015 IEEE International Con- omni.tar.gz (780 MB). ference on Image Processing (ICIP), pp. 4952–4956, Quebec, Canada, September 2015. [10] P. Anandan, S. Gagliano, and M. Bucolo, “Computational Conflicts of Interest models in microfluidic bubble logic,” Microfluidics and (e authors declare that they have no conflicts of interest. Nanofluidics, vol. 18, no. 2, pp. 305–321, 2015. [11] F. Cairone, P. Anandan, and M. Bucolo, “Nonlinear systems synchronization for modeling two-phase microfluidics flows,” Supplementary Materials Nonlinear Dynamics, vol. 92, no. 1, pp. 75–84, 2018. [12] A. A. Argyros, K. E. Bekris, S. C. Orphanoudakis, and Tables S1, S2, and S3 show the median, mean, and standard L. E. Kavraki, “Robot homing by exploiting panoramic vi- deviation of the error between the ground truth and results sion,” Autonomous Robots, vol. 19, no. 1, pp. 7–25, 2005. estimated from different methods. (e column “ours” is the [13] R. Benosman, S. Kang, and O. Faugeras, Panoramic Vision, result estimated by the proposed method. (e “gray-scale” is Springer-Verlag New York, Berlin, Germany, 2000. converted from the input RGB image. (e remaining [14] H. Kuang, Q. Xu, X. Long, and S. Schwertfeger, “Pose esti- “random noise,” “white” and “black,” is generated manually. mation for omni-directional cameras using sinusoid fitting,” (e column with “result” is calculated by comparing the in Proceedings of the IEEE/RSJ International Conference on ground truth and the image directly whereas that with Intelligent Robots and Systems (IROS), Macau, China, No- “input” is computed by firstly taking the image as the input vember 2019. [15] T. Lemaire and S. Lacroix, “Slam with panoramic vision,” of the proposed network and then comparing the output Journal of Field Robotics, vol. 24, no. 1-2, pp. 91–111, 2007. with ground truth. (e “ours without GAN” denotes the [16] Q. Xu, A. Gomez Chavez, H. Bulow, ¨ A. Birk, and result about the model trained by the original in-air dataset, S. Schwertfeger, “Improved fourier mellin invariant for robust without CycleGAN. In addition, the “gt size” is the number rotation estimation with omni-cameras,” in Proceedings of the of points provided by ground truth. (Supplementary 2019 26th IEEE International Conference on Image Processing. Materials) IEEE, Taipei, Taiwan, September 2019. [17] B. Terry, “Dove: dolphin omni-directional video equipment,” References in Proceedings of the International Conference on Robotics and Automation, pp. 214–220, Paris, France, May 2000. [1] A. Gomez Chavez, Q. Xu, C. A. Mueller, S. Schwertfeger, and [18] J. Bosch, N. Gracias, P. Ridao, and D. Ribas, “Omnidirectional A. Birk, “Adaptive navigation scheme for optimal deep-sea underwater camera design and calibration,” Sensors, vol. 15, localization using multimodal perception cues,” in Proceed- no. 3, pp. 6033–6065, 2015. ings of the IEEE/RSJ International Conference on Intelligent [19] F. Bruno, G. Bianco, M. Muzzupappa, S. Barone, and Robots and Systems (IROS), Macau, China, November 2019. A. V. Razionale, “Experimentation of structured light and [2] J. Yuh and M. West, “Underwater robotics,” Advanced Ro- stereo vision for underwater 3d reconstruction,” ISPRS botics, vol. 15, no. 5, pp. 609–639, 2001. Journal of Photogrammetry and Remote Sensing, vol. 66, no. 4, [3] C. Beall, B. J. Lawrence, I. Viorela, and D. Frank, “3d re- pp. 508–518, 2011. construction of underwater structures,” in Proceedings of the [20] I. Armeni, S. Sax, A. R. Zamir, and S. Savarese, “Joint 2d-3d- 2010 IEEE/RSJ International Conference on Intelligent Robots semantic data for indoor scene understanding,” 2017, https:// and Systems, pp. 4418–4423, IEEE, Taipei, Taiwan, September arxiv.org/abs/1702.01105. [21] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image- [4] J. Li, K. A. Skinner, E. Ryan, and M. J.-R. Watergan, “Un- to-image translation using cycle-consistent adversarial net- supervised generative network to enable real-time color works,” in Proceedings of the 2017 IEEE International Con- correction of monocular underwater images,” IEEE Robotics ference on Computer Vision (ICCV), Venice, Italy, October and Automation Letters (RA-L), pp. 387–394, 2017. [5] P. L. J. Drews, E. R. Nascimento, S. S. C. Botelho, and [22] H. Kuang, Q. Xu, and S. Schwertfeger, “Depth estimation on M. F. Montenegro Campos, “Underwater depth estimation underwater omni-directional images using a deep neural and image restoration based on single images,” IEEE Com- network,” 2019, https://arxiv.org/abs/1905.09441. puter Graphics and Applications, vol. 36, no. 2, pp. 24–35, [23] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” in Proceedings [6] T. Łuczynski ´ and A. Birk, “Underwater image haze removal of the IEEE International Conference on Computer Vision, with an underwater-ready dark channel prior,” in OCEANS 2017, pp. 1–6, IEEE, Anchorage, AK, USA, September 2017. pp. 1422–1430, Santiago, Chile, December 2015. [24] J. Donahue, P. Krahenb ¨ uhl, ¨ and Trevor Darrell, “Adversarial [7] K. Kaiming He, J. Jian Sun, and X. Xiaoou Tang, “Single image haze removal using dark channel prior,” IEEE Transactions on feature learning,” 2016, https://arxiv.org/abs/1605.09782. Journal of Robotics 11 [25] A Dosovitskiy, P. Fischer, J. Tobias Springenberg, estimation and visual odometry with deep feature recon- M. Riedmiller, and T. Brox, “Discriminative unsupervised struction,” in Proceedings of the IEEE Conference on Computer feature learning with exemplar convolutional neural net- Vision and Pattern Recognition, pp. 340–349, Long Beach, CA, works,” IEEE Transactions on Pattern Analysis and Machine USA, June 2019. Intelligence, vol. 38, no. 9, pp. 1734–1747, 2015. [40] H. Zhan, C. S. Weerasekera, J. Bian, and I. Reid, “Visual [26] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised odometry revisited: what should be learnt?,” 2019, https:// representation learning by predicting image rotations,” in arxiv.org/abs/1909.09803. Proceedings of the International Conference on Learning [41] P.-Y. Chen, H. Alexander, Y.-C. Liu, and Y.-C. F. Wang, Representations, Vancouver, Canada, April 2018. “Towards scene understanding: unsupervised monocular [27] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colori- depth estimation with semantic-aware representation,” in zation,” in Proceedings of the European conference on com- Proceedings of the IEEE Conference on Computer Vision and puter vision, pp. 649–666, Amsterdam, Netherlands, October Pattern Recognition, pp. 2624–2632, Long Beach, CA, USA, June 2019. [28] C. Vondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and [42] Z. Yin and J. Shi, “Geonet: unsupervised learning of dense K. Murphy, “Tracking emerges by colorizing videos,” in depth, optical flow and camera pose,” in Proceedings of the Proceedings of the European Conference on Computer Vision IEEE Conference on Computer Vision and Pattern Recognition, (ECCV), pp. 391–408, Munich, Germany, September 2018. pp. 1983–1992, Long Beach, CA, USA, June 2019. [29] X. Wang and A. Gupta, “Unsupervised learning of visual [43] A. Ranjan, V. Jampani, L. Balles et al., “Competitive collab- representations using videos,” in Proceedings of the IEEE oration: joint unsupervised learning of depth, camera motion, International Conference on Computer Vision, pp. 2794–2802, optical flow and motion segmentation,” in Proceedings of the Santiago, Chile, December 2019. IEEE Conference on Computer Vision and Pattern Recognition, [30] E. Jang, C. Devin, V. Vincent, and S. Levine, “Grasp2vec: pp. 12240–12249, Long Beach, CA, USA, June 2019. learning object representations from self-supervised grasp- [44] H. Gupta and K. Mitra, “Unsupervised single image under- ing,” in Proceedings of the Conference on Robot Learning, water depth estimation,” in Proceedings of the 2019 IEEE Zurich, Switzerland, October 2018. International Conference on Image Processing (ICIP), [31] A. Nair, S. Bahl, K. Alexander, P. Vitchyr, G. Berseth, and pp. 624–628, Taipei, Taiwan, September 2019. S. Levine, “Contextual imagined goals for self-supervised [45] D. Paul, E. Nascimento, F. Moraes, S. Botelho, and robotic learning,” in Proceedings of the Conference on Robot M. Campos, “Transmission estimation in underwater single Learning, Osaka, Japan, October 2019. images,” in Proceedings of the IEEE International Conference [32] X. Zhi, X. He, and S. Schwertfeger, “Learning autonomous on Computer Vision Workshops, pp. 825–830, Sydney, Aus- exploration and mapping with semantic vision,” in Pro- tralia, April 2013. ceedings of the International Conference on Image, Video and [46] Y.-T. Peng and P. C. Cosman, “Underwater image restoration Signal Processing. IVSP, Shanghai China, February 2019. based on image blurriness and light absorption,” IEEE [33] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised Transactions on Image Processing, vol. 26, no. 4, pp. 1579– monocular depth estimation with left-right consistency,” in 1594, 2017. Proceedings of the IEEE Conference on Computer Vision and [47] X. Ding, Y. Wang, J. Zhang, and X. Fu, “Underwater image Pattern Recognition, pp. 270–279, Honolulu, HI, USA, July 2017. dehaze using scene depth estimation with adaptive color [34] H. Zhan, C. S. Weerasekera, R. Garg, and I. Reid, “Self-su- correction,” in OCEANS 2017, pp. 1–5, Aberdeen, Scotland, pervised learning for single view depth and surface normal June 2017. estimation,” in Proceedings of the 2019 International Con- [48] C. O Ancuti, C. Ancuti, C. De Vleeschouwer, L. Neumann, ference on Robotics and Automation (ICRA), pp. 4811–4817, and R. Garcia, “Color transfer for underwater dehazing and Montreal, Canada, May 2019. depth estimation,” in Proceedings of the 2017 IEEE Interna- [35] F. Ma, G. V. Cavalheiro, and S. Karaman, “Self-supervised tional Conference on Image Processing (ICIP), pp. 695–699, sparse-to-dense: self-supervised depth completion from lidar Beijing, China, September 2017. and monocular camera,” in Proceedings of the 2019 Inter- [49] K. A. Skinner, E. Iscar, and M. Johnson-Roberson, “Auto- national Conference on Robotics and Automation (ICRA), matic color correction for 3d reconstruction of underwater pp. 3288–3295, Montreal, Canada, May 2019. scenes,” in Proceedings of the 2017 IEEE International Con- [36] A. Wong and S. Soatto, “Bilateral cyclic constraint and ference on Robotics and Automation (ICRA), pp. 5140–5147, adaptive regularization for unsupervised monocular depth Singapore, May 2017. prediction,” in Proceedings of the IEEE Conference on Com- [50] J. S. Jaffe, “Computer modeling and the design of optimal puter Vision and Pattern Recognition, pp. 5644–5653, Long underwater imaging systems,” IEEE Journal of Oceanic En- Beach, CA, USA, June 2019. gineering, vol. 15, no. 2, pp. 101–111, 1990. [37] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsu- [51] B. L. McGlamery, “Computer analysis and simulation of pervised learning of depth and ego-motion from video,” in underwater camera system performance,” SIO Reference, Proceedings of the IEEE Conference on Computer Vision and vol. 75, no. 2, 1975. Pattern Recognition, pp. 1851–1858, Honolulu, HI, USA, July [52] K. A. Skinner, J. Zhang, E. A. Olson, and M. J.-R. Uwstereonet, “Unsupervised learning for depth estimation and color cor- [38] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, rection of underwater stereo imagery,” in Proceedings of the “Digging into self-supervised monocular depth estimation,” 2019 International Conference on Robotics and Automation in Proceedings of the IEEE International Conference on Computer Vision, pp. 3828–3838, Seoul, Korea, November (ICRA), pp. 7947–7954, Singapore, May 2019. [53] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and [39] H. Zhan, R. Garg, C. S. Weerasekera, K. Li, H. Agarwal, and N. Navab, “Deeper depth prediction with fully convolutional I. Reid, “Unsupervised learning of monocular depth residual networks,” in Proceedings of the 2016 Fourth 12 Journal of Robotics International Conference on 3D Vision (3DV), pp. 239–248, Stanford, California, October 2016. [54] K. Tateno, N. Navab, and F. Tombari, “Distortion-aware convolutional filters for dense prediction in panoramic im- ages,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 707–722, Munich, Germany, September [55] I. Goodfellow, J. Pouget-Abadie, M. Mirza et al., “Generative adversarial nets,” Advances in Neural Information Processing Systems, pp. 2672–2680, 2014. [56] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” Ad- vances in Neural Information Processing Systems, pp. 2366– 2374, 2014. [57] L. Jin, Y. Xu, Z. Jia et al., “Geometric structure based and regularized depth estimation from 360 indoor imagery,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 889–898, Seattle, WA, USA, June [58] O. Ronneberger, P. Fischer, and T. Brox, “U-net: convolu- tional networks for biomedical image segmentation,” in Proceedings of the International Conference on Medical image computing and computer-assisted intervention, pp. 234–241, Munich, Germany, October 2015. [59] Z. Zhang, Y. Xu, J. Yu, and S. Gao, “Saliency detection in 360 videos,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 488–503, Munich, Germany, September 2018. [60] H. Stewenius, D. Nister, F. Kahl, and F. Schaffalitzky, “A minimal solution for relative pose with unknown focal length,” in Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 2, pp. 789–794, San Diego, California, June [61] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, Cambridge University Press, Cambridge, UK, 2003. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Robotics Hindawi Publishing Corporation

Underwater Depth Estimation for Spherical Images

Loading next page...
 
/lp/hindawi-publishing-corporation/underwater-depth-estimation-for-spherical-images-B89a3TOXC9

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
Hindawi Publishing Corporation
Copyright
Copyright © 2021 Jiadi Cui et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
ISSN
1687-9600
eISSN
1687-9619
DOI
10.1155/2021/6644986
Publisher site
See Article on Publisher Site

Abstract

Hindawi Journal of Robotics Volume 2021, Article ID 6644986, 12 pages https://doi.org/10.1155/2021/6644986 Research Article Jiadi Cui , Lei Jin, Haofei Kuang, Qingwen Xu, and So ¨ren Schwertfeger Mobile Autonomous Robotic Systems Lab, School of Information Science and Technology, ShanghaiTech University, Shanghai, China Correspondence should be addressed to Jiadi Cui; cuijd@shanghaitech.edu.cn Received 15 December 2020; Accepted 29 May 2021; Published 18 June 2021 Academic Editor: L. Fortuna Copyright © 2021 Jiadi Cui et al. (is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. (is paper proposes a method for monocular underwater depth estimation, which is an open problem in robotics and computer vision. To this end, we leverage publicly available in-air RGB-D image pairs for underwater depth estimation in the spherical domain with an unsupervised approach. For this, the in-air images are style-transferred to the underwater style as the first step. Given those synthetic underwater images and their ground truth depth, we then train a network to estimate the depth. (is way, our learning model is designed to obtain the depth up to scale, without the need of corresponding ground truth underwater depth data, which is typically not available. We test our approach on style-transferred in-air images as well as on our own real un- derwater dataset, for which we computed sparse ground truth depths data via stereopsis. (is dataset is provided for download. Experiments with this data against a state-of-the-art in-air network as well as different artificial inputs show that the style transfer as well as the depth estimation exhibit promising performance. images to estimate depth. In addition, deep learning was also 1. Introduction applied to estimate the depth of underwater images, for Underwater depth estimation is an open problem for marine example, the study in [4] used a convolution neural network robotics [1, 2], which is usually used for 3D reconstruction, (CNN) to generate relative depth, which was then one of the navigation, and intermediate steps for underwater color inputs for a color correction network. Learning-based correlation [3, 4]. Due to the properties of underwater methods are very popular these days, and there are many environments, underwater perception is quite different from applications about depth estimation, for example also in in-air perception. Images captured underwater usually look some microsystems [10, 11]. bluish because longer wavelengths of the visible sunlight are Apart from normal pin-hole cameras, omnidirectional absorbed earlier than shorter wavelengths. Underwater cameras are becoming popular, due to their large field of images may also be more greenish, because of algae in the view (FOV). (ey have been widely used on ground robots water. Besides, the underwater images are more blurred than [12–16]. Some research groups also studied omnidirectional those in-air captured by the same camera, due to turbidity. cameras for underwater use since they provide more in- (ese reasons increase the difficulty of depth estimation formation than perspective ones on object detection, lo- from images. (us, many researchers put effort on under- calization, and mapping. (e study in [17] designed water image processing. For example, using dark channel omnidirectional video equipment and put it on dolphins to priors is proposed to restore underwater images in [5, 6], capture data. (e study in [18] improved on-land omni- inspired by [7] on removing haze in air. (e study in [8] directional cameras for underwater use and proposed the implemented underwater image stitching based on spectral method for camera calibration. methods, which are more robust to turbidity than feature- In addition, the sometimes long visible distances in water based methods. Besides image enhancement, some work increase the region of undefined depth, especially compared focused on depth estimation. (e study in [9] exploited the to indoor scenes, which makes the depth estimation more relationship between depth and blurriness of underwater difficult. Although there are several papers on active 2 Journal of Robotics methods for underwater 3D imaging [19], capturing om- constraints to estimate both depth and surface normals. (e nidirectional underwater depth images remains a big study in [35] investigated the multimodality depth com- pletion task with a self-supervised method by constructing a challenge, which makes ground truth depth unavailable. (is paper proposes to leverage publicly available in-air spherical loss function with photometric constraints, and their images for depth estimation in the underwater domain. method achieved the state of the art (SOTA) on the KITTI Specifically, our approach follows a two-stage pipeline. (i) depth completion benchmark. (e study in [36] exploited Given in-air RGB-D spherical pairs from the Stanford 2D- the bilateral cyclic relationship between stereo disparities 3D-S dataset [20], we train a style-transfer network [21] to and proposed an adaptive regularization scheme to handle convert in-air images to the underwater domain. (ii) Given covisible and occluded problems in a stereo pair. the generated underwater images and their depth maps, we Different from geometric constraints-based methods, train a depth estimation network which is specially designed there are some approaches that try to exploit the constraint for spherical images. During testing, we can generate depth between different modalities, called the wrapped-based directly from the input image. Our approach is unsupervised method. (e study in [37] proposed a wrapped-based method to estimate both depth and pose. (ey designed a in that only underwater images (i.e., no ground truth un- derwater depth) are required for the whole training process. loss based on wrapping nearby views to the target using the Following our preliminary work [22], the main contri- computed depth and pose. (e study in [38] proposed butions of our paper are as follows: monodepth2 to combine depth and camera pose with ge- ometry constraints. To improve the robustness of the model, (i) To the best of our knowledge, we are the first group they also proposed the minimum reprojection loss and to employ CycleGAN to spherical underwater utilized a multiscale sampling method in their framework. images Currently, monodepth2 achieves SOTA results on the KITTI (ii) (is is also the first method to employ deep learning benchmark. Because these methods can predict both depth to estimate depth in spherical underwater images and camera pose, they are wildly used in robotics and self- (iii) We provide a spherical underwater dataset, which driving cars as a visual odometry (VO) system. Zhan et al. investigated the end-to-end unsupervised depth-VO [39] consists of 3,000 high-quality images from the Great Barrier Reef and also integrated the depth with Perspective-n-Point (PnP) method to achieve high robustness [40]. (iv) We provide a benchmark of the proposed network (is idea was also extended to combine more computer with respect to handcrafted images vision tasks. (e study in [41] exploited the content con- sistency between the depth and semantic information. (e study in [42] proposed the GeoNet to utilize the geometric 2. Related Work relationships between depth, optical flow, and camera pose 2.1. Unsupervised Depth Learning. Learning-based methods and use an unsupervised learning framework to predict them. (e study in [43] proposed a competitive collabo- for depth estimation are popular. However, for adversarial environments, such as underwater or forest scenarios, an- ration framework to predict depth, pose, optical flow, and notated data is difficult to obtain. (erefore, supervised motion segmentation parallel with an unsupervised learning has difficulties in achieving a good performance method. with the absence of a large amount of labeled data. Unsu- Currently, unsupervised depth estimation is successful pervised learning and self-supervised learning are two in an indoor or urban scenario. But there are still few ap- methods for utilizing unlabeled data in the learning process. plications in adversarial scenarios. (e study in [44] pro- One reason for using unlabeled data is that producing a posed a generative model and exploited cycle-consistent dataset with clear labels is expensive, but unlabeled data is constraints to train the model in an unsupervised fashion. being generated all time. (e motivation is to make use of (eir method achieves the SOTA on their dataset, but it is the much larger amount of unlabeled data. (e main idea of also hard to implement in real underwater applications and the amount of available data is also not enough for training. self-supervised learning is to generate the labels from un- labeled data, according to the structure or characteristics of the data itself, and train with this unsupervised data through 2.2. Underwater Depth Estimation and Color Correction. a supervised manner. Self-supervised learning is wildly used in representation learning to make a model learn the latent In contrast to on-land scenarios, underwater depth esti- mation is more challenging due to scattering and absorption features of the data. (ese methods are wildly used in computer vision [23–27], video processing [28, 29], and effects [9, 45], as mentioned above. For that, several methods jointly optimize depth estimation and color correction. In robot control [30–32]. (ere is much previous work related to the self-super- other words, accurate depth helps restore image colors and depth can also be estimated from the information of color vised method for depth estimation. In 2017, [33] proposed distortion. For example, the authors of [9, 46] presented an the monodepth framework to exploit epipolar geometry constraints and proposed a novel training loss to train their image formulation model to estimate depth from image blurriness. In [5], a dark channel prior is used for under- model along a self-supervised way. After that, there are some methods related to using geometry constraints to achieve water depth estimation and image restoration to dismiss the attenuation, backscattering effects. (e study in [47] self-supervision. (e study in [34] utilized epipolar geometry Journal of Robotics 3 presented adaptive image dehazing based on the depth 3. Methodology information. Figure 1 demonstrates our two-stage pipeline. (i) Given in- As introduced in Section 2.1 (Unsupervised Depth air RGB-D spherical pairs from the Stanford2D-3D-S Learning), there are many successful learning methods to dataset [20], we train CycleGAN [21] to convert in-air estimate depth for in-air images. (us, a naive way to es- images to the underwater domain. (ii) Given the generated timate underwater depth is to restore underwater images to underwater images and their depth maps, we train a depth in-air style so that this depth learning strategy can be ap- estimation network to learn depth. In the following, we plied. In [48], such a strategy proves to be efficient in un- introduce the two parts separately. derwater depth estimation. Both deep learning and mathematical methods are very popular for image resto- ration. In [49], they use the Jaffe-McGlamery model [50, 51], 3.1. Style Transfer. Generative adversarial nets (GANs) are a mathematical method, to handle the problems, which designed for data augmentation and are now widely used in decreases the absorption and scattering effects based on style-transfer tasks. GANs are two-player mini-max games irradiance and depth. In [52], a learning-based method was between a generative model G and a discriminative model D proposed to solve depth estimation and color correction in [55]. (e value function about this adversarial process is spherical domains at the same time by solving left-right min max V(D, G) � E [logD(x)] consistency under a multicamera setting. However, deep x∼p (x) data G D (1) learning usually requires a large amount of data, which is not + E [log(1 − D(G(z)))], available for the underwater field. To overcome this problem, z∼p (z) the study in [4] proposed a generative adversarial network to where p denotes the features in the data and p holds data z generate synthetic underwater images from in-air datasets. random values at first. (is value function is also the loss Our work is inspired by WaterGAN [4], but also different function for the deep neural network. from it. WaterGAN requires depth as input to simulate the (e underwater style-transfer algorithm CycleGAN [21] attenuation and scattering effect, while our underwater GAN consists of two networks, a network G for forward mapping only needs underwater and in-air images as input. Our pre- and a network F for inverse mapping. Given input images, liminary work is reported in [22], where we proposed the two- network G converts to the target domain and network F stage pipeline to solve underwater omnidirectional depth es- converts back to the original domain. A cycle consistency is timation. In the first perspective image pipeline, the Water- enforced as F(G(X)) ≈ X and vice versa, to ensure the GAN [4] was used to transfer RGB-D images to underwater mappings will be constrained well. (us, the loss function of RGB-D images. (en, a fully convolutional residual network the forward mapping function G: X ⟶ Y is (FCRN) [53] depth estimation network was trained with the underwater image as input. In the second omnidirectional L G, D , X, Y􏼁 � E 􏼂logD (y)􏼃 GAN Y y∼p (y) Y data stage, we synthesized images from in-air equirectangular + E 􏼂log 1 − D (G(x))􏼃. x∼p (x) Y images to underwater equirectangular images by decreasing data the values in the red channel (due to its short wavelength (2) nature in the underwater environment) and blurring the image We use X as input to domain D and Y as input to based on its distance to the camera origin. Finally, inspired by domain D . Examples of our input images from the two [54], a distortion-aware convolution module replaced the Y domains are demonstrated in Figures 2 and 3. Since both our normal convolution in the FCRN based on the spherical input and output operate under the spherical domain, we longitude-latitude mapping. In this work, we replace the directly adopt the network with no modification to the simple operations in the red channel with a learning method to convolution operators. generate synthetic underwater omnidirectional images. In Moreover, CycleGAN applies a new idea about cycle addition, we improve the method to estimate underwater consistency, which is y ⟶ F(y) ⟶ G(F(y)) ≈ y. And depth. Finally, we are more thoroughly evaluating the results of the loss function on this step is our algorithm, by estimating ground truth depths for dis- tinctive feature points. In [54], the FCRN [53] was identified as L (G, F) � E 􏼂‖F(G(x)) − x‖ 􏼃 cyc x∼p (x) 1 data the state-of-the-art (SOTA) network for omnidirectional (3) + E 􏼂‖G(F(y)) − y‖ 􏼃. CNNs, and we thus adopt it and compare to it in this paper. y∼p (y) 1 data We want to emphasize that, in general, depth estimation Finally, the full objective for CycleGAN is from a single RGB image is a very challenging problem. As our experiments later will show, our approach does not give L G, F, D , D 􏼁 � L G, D , X, Y 􏼁 X Y GAN Y very accurate estimates, neither do the other depth esti- + L F, D , Y, X􏼁 GAN X mation approaches mentioned in this section. Also, as with (4) any monocular vision problem, our results are up to an + λL (G, F), cyc unknown scale factor. Nevertheless, we believe this work to ∗ ∗ G , F � arg min max L G, F, D , D 􏼁. X Y be worthwhile because it paves a path towards potentially G,F D ,D x Y more successful approaches (see the future work) and, even not being very accurate, has potential use cases, for example Because the method is pixel-to-pixel, the dataset is preprocessed by resizing the images into a reasonable size. in navigation or color correction. 4 Journal of Robotics Training: Fully convolutional residual network Corresponding In-air image RGB in-air depth + new loss function Depth prediction CycleGAN Underwater RGB Synthetic underwater images Testing: FCRN Test image Depth prediction Figure 1: Full pipeline of our approach. We propose to leverage publicly available RGB-D datasets for style transfer and depth estimation in an unsupervised approach. underwater image. We can then train our network with the converted X and D pairs. Following the recent success of depth estimation in the spherical domain [57], we adopt FCRN, one of the state-of- the-art single models on NYUv2 [53]. (e network consists of a feature extraction model and then several upconvolu- tions layers to increase the resolution. Here, an UNet [58] is used as the backbone in all our experiments. Finally, the L1 difference will be calculated between the output depth and ground truth depth maps: � � � � � � Figure 2: A typical underwater omnidirectional image. � � L � 􏽘 D − D , depth � pred gt� (6) d∈x,y Compared with WaterGAN, the CycleGAN only needs where D denotes the prediction of the network, D pred gt underwater and in-air images as input, whereas WaterGAN denotes the ground truth depth map, and x, y enumerate all requires depth as input to simulate the attenuation and the pixels in the input image. scattering effects. Smoothness regularization has been used frequently for depth estimation in planar images in previous research 3.2. Depth Estimation. With the recent success of con- [33, 38] to encourage the estimated depths to be locally volutional neural networks, different CNN-based ap- similar. For depth estimation in perspective images, the term proaches are proposed to solve the supervised depth is defined as follows: � � estimation task [53, 56]. However, most of the above ap- � � � � L � 􏽘 􏽘 �∇ D p􏼁 � , sm d t t proaches require large amounts of accurate image and (7) p d∈x,y ground truth depth pairs, currently unavailable in the spherical underwater domain. Instead, we propose to le- where L is a smoothness term that penalizes the L1 norm sm verage an available in-air spherical dataset, the Stanford 2D- of first-order depth gradients along both the x and y di- 3D-S benchmark [20], and convert it to underwater style rections in 2D space. with StyleGAN. Specifically, given X , D pairs from the raw i i (e equirectangular projection of a 360 image, however, Stanford 2D-3D-S benchmark, we first convert X to the is with distortion, and directly leveraging depth smoothness underwater domain X : i terms means we must impose larger weights for the point pairs with larger latitudes. Simply combining the above loss X � CycleGAN X , (5) i i designed for perspective images into the training process where X denotes the original in-air image from the dataset, might lead to suboptimal results. (e reason is that the D its corresponding depth, and X is the converted equirectangular projection of spherical images oversamples i i Journal of Robotics 5 Figure 3: Generated images with our CycleGAN. (a) On the left are examples from Domain I (in-air). (b) On the right are our generated images. We are able to produce the lightening color effects from the original underwater dataset. the image in the polar regions. Taking inspiration from the exact ground truth to quantitatively evaluate the algorithm. recent work of learning in the spherical domain [59], we Here, we also compare to the SOTA algorithm for in-air propose that the weight of the distance of two points is based spherical images: FCRN [53], in two setups. We test FCRN on their spherical distance, after which we arrive at the with the synthetic (GAN) images, as well as with the original following spherical depth smoothness regularizer: RGB images as input. All algorithms are trained using the synthetic underwater images. (e second experiment uses Θ,Φ � � sph � � real omnidirectional underwater images and sparse ground � � L � 􏽘 􏽘 ω ∇ D p􏼁 , � � (8) sm θ,ϕ d t t truth points estimated via bundle adjustment to test the θ�0,ϕ�0 algorithm with in situ data. where ω are the weights for each point and ω ∝Ω(θ, ϕ). In the following, we first introduce the datasets, θ,ϕ i,j Ω(θ, ϕ) is the solid angle corresponding to the sampled area hyperparameters, and evaluation metrics used in the sph on the depth map located at (θ, ϕ). L is a spatial experiments. sm smoothness term that penalizes the L1 norm of second-order depth gradients along both the θ and ϕ directions in 2D 4.1. Datasets. Stanford 2D-3D-S [20] is one of the standard space. benchmarks for in-air datasets. (e dataset provides om- Our final loss is a weighted combination of the above nidirectional RGB images and corresponding depth infor- factors with λ as the weighting factor: mation, which is necessary data for depth estimation sph (9) training. Furthermore, it also provides semantics in 2D and L � L + λ L . depth 1 sm 3D, 3D mesh, and surface normals. In addition, we use a dataset that we collected by scuba diving in the Great Barrier Reef. We use this for training our 4. Experimental Details CycleGAN with original, spherical underwater images as We evaluate our approach with two experiments. Firstly, we well as for testing our approach. (is omnidirectional use the synthetic underwater Stanford 2D-3D-S dataset with dataset for style transfer and testing was collected with an 6 Journal of Robotics Insta360 ONE X (https://www.insta360.com/product/ insta360-onex) camera at depths between 1 m and 25 m. To evaluate the final results from our two-stage pipeline, the ground truth depth of the underwater scenario is gen- erated based on epipolar geometry. (e generation steps are as follows: firstly, a pair of stereo images with a known baseline are used to estimate sparse map points by feature matching, five-point algorithm [60], and triangulation [61]. (en, two pairs of stereo images, taken at different times, with big enough spatial disparity, including the one for map points, are used to fine-tune the position of the map points Figure 4: An example of ground truth points. (e picture is with bundle adjustment. Finally, the depth of these map captured by Insta360 ONE X camera at real ocean scenarios. Green points is normalized to 0 to 255 and used as up-to-scale points represent the interest points, whose depths are calculated by ground truth. stereopsis. Figure 4 shows an example of points (green dots) that are used as ground truth. It can be seen that most of these points be denoted by (i, j), we find the corresponding depth in the are on the reef instead of water because the open water and ground truth and estimated depth. (e result of our esti- the surface do not have feature points. (ough only sparse mation is up to an unknown scale factor. We thus minimize points are generated, we believe that they are sufficient for the error by calculating the best fitting scale factor for the the evaluation of our depth results. On the underwater ground truth. To do so, we calculate the scale parameter dataset used for evaluation, we generate about 100 points for between each pair of ground truth and result and then get each image. the median factor. To be more specific, in one pair of ground truth and result, there is the ratio of the ground truth value 4.2. Hyperparameters. (e hyperparameters for the style P (i, j) to the result value P(i, j) for each point pairs. (en, gt transfer include the resolution of input images, which is set using these ratios for one image, we can calculate their to 512 × 256 pixel. We then train the CycleGAN [21] with median s to simulate the optimization procedure, like the these hyperparameters: learning rate (2e-4) and number of least-square method, and set the median s as the scale pa- epochs (8). rameter between the ground truth and result. Finally, we We implement the FCRN for depth estimation with the rescale the result and compute the error E about each point. PyTorch framework and train our network with the fol- (e error E about each image is calculated by 􏼌 􏼌 lowing hyperparameters settings during pretraining: mini- 􏼌 􏼌 􏼌 􏼌 􏼌 􏼌 P (i, j) − s · P(i, j) 􏼌 􏼌 gt batch size (8), learning rate (1e-2), momentum (0.9), weight ⎛ ⎝ ⎞ ⎠ E � Q , if P (i, j)≠ 0. (10) 1/2 gt decay (0.0005), and number of epochs (50). We gradually P (i, j) gt reduce the learning rate by 0.1 every 10 epochs. Finally, we tune the whole network with learning rate (1e-4) for another Here, the operation Q is to calculate the median of all 1/2 20 epochs. λ is set to 1e-4 in all our experiments. cases for the ground truth points and the result points. 5. Results 4.3. Metrics. For our depth estimation network, we adopt FCRN [53] and compare the model with the initial loss In this section, we will demonstrate the results on the function and our new loss function. Apart from these two converted Stanford 2D-3D-S dataset and real underwater networks, we also use FCRN based on the original in-air images collected in the Great Barrier Reef. images, which are not processed by CycleGAN. For eval- uation, we use the following common metrics for the 5.1. Evaluation of Synthetic Images. Since there are few comparisons on the datasets mentioned above: root mean 􏽱���������������� underwater datasets with ground truth depth, we synthesize square error (RMS) (1/T)􏽐 (g − z ) , mean relative p p p underwater style images from the Stanford 2D-3D-S dataset. error (Rel) (1/T)􏽐 (‖g − z ‖/g ), mean log 10 error CycleGAN [21] is used to generate synthetic underwater p p p p (log 10 ) (1/T)􏽐 ‖log g − log z ‖, and pixel accuracy as images in this work. Figure 4 shows several examples of the p 10 p 10 p gt gt synthetic images. It can be seen the generated images suc- the percentage of pixels with max((z /z ), (z /z ))< δ for i i i i 2 3 cessfully transfer the in-air images to underwater style, es- δ ∈ [1.25, 1.25 , 1.25 ]. T denotes the numbers of pixels and pecially with respect to color. g and z represent the ground truths and the depth map p p One interesting phenomenon during the transfer is that predictions, respectively. if we attempt to train for many epochs in the style-transfer network, a lot of unnecessary and unreasonable features are 4.4. Metric for Real Experiment. To evaluate the final results also learned. However, in most cases, we just need to transfer of our two-stage approach, we rely on the sparse ground some specific features, like color. (e testing on our own truth points captured with the approach described in Section underwater dataset revealed that the estimation results for 4.1. (Datasets). For all nonzero points, whose positions will some water-only parts are not accurate enough. (is may Journal of Robotics 7 Figure 5: Generated depth from style-transferred underwater Stanford 2D-3D-S dataset. (a) On the left are the input images. (b) On the right are the corresponding predicted depth maps. also be due to the fact that indoor scenarios are too different of our knowledge, we are the first to propose an algorithm from the underwater domain. for depth estimation on spherical underwater images. Fig- ure 6 demonstrates the estimated depth on our underwater Figure 5 presents the results of the estimated depth from the synthetic underwater Stanford 2D-3D-S dataset, where dataset. Similarly, it can be seen that the brighter parts on the right correspond to areas more far away on the right of brighter pixels represent a larger depth and darker pixels are closer. It can be seen that the estimated depths on the right of Figure 6, which implies that the network at least estimates Figure 5 corresponding to the left image are acceptable, es- the depth correctly in some regions. pecially the further area. Additionally, Table 1 gives a more Because our network is based on the Stanford 2D-3D-S rigorous evaluation of the results. Comparing to the classic dataset, in which the original images are all lacking the upper FCRN network, our improved loss function gives slightly and lower parts (15.6% of the image height for each part), better results as indicated by the smaller RMS, Rel, and log10. these parts are filled with pure black pixels. (erefore, the It can also be seen from the FCRN RGB experiment that upper and lower parts in the final results about underwater using RGB images for training the SOTA network gives far depth estimation are also not evaluated. In the other words, we only use panorama images instead of spherical images worse results compared to ours and also to FCRN trained with GAN images. Because the style-transferred images actually. mainly imitate the color information, the network was (ough our underwater dataset does not have ground adopted to estimate the depth information from these truth depth maps, we can evaluate the results with the sparse images. map points. We randomly choose 20 images to test with the corresponding ground truth calculated by stereopsis. According to the metric presented, the results are shown 5.2. Evaluation of Real Underwater Images. After achieving in the first row of Table 2. (ere, each column shows results acceptable results on the synthetic dataset, we also evaluate averaged over all images. In the first column, we take the the results on the real underwater images. Note that we median of the errors of all pixels for which we have ground cannot compare to any other methods here, since, to the best truth in that image, in the second column we take the mean 8 Journal of Robotics Table 1: Performance comparison on 1412 images from the Stanford 2D-3D-S dataset. 2 3 Methods RMS (m) ↓ Rel (m) ↓ log10 ↓ δ< 1.25↑ δ< 1.25 ↑ δ< 1.25 ↑ sph Ours: + L 0.683 0.177 0.075 0.744 0.919 0.972 grad FCRN GAN 0.687 0.181 0.078 0.737 0.920 0.972 FCRN RGB 1.281 0.327 0.181 0.387 0.648 0.801 All tests use images transformed with GAN as input. Our approach and FCRN GAN were trained with synthetic images, while FCRN RGB uses, for comparison, RGB images as training data. (e terms are explained below. (e arrows indicate that smaller (↓) or bigger (↑) values are better. Figure 6: Generated depth from our underwater dataset. (a) On the left are the input images. (b) On the right are the corresponding predicted depth maps. We can find the upper and lower parts (15.6% of the image width for each part) are not good, and the reasons are shown in Evaluation Section. Table 2: Performance comparison between the ground truth and various results. Results types Average median error Average mean error Average standard deviation Ours 0.22 0.40 0.62 FCRN (trained with RGB) 0.30 3.76 7.16 Black result 1.00 1.00 0.00 White result 0.95 1.10 0.65 Random noise result 0.96 2.83 3.31 Gray-scale result 0.95 1.10 7.12 Black input 0.27 3.75 7.18 White input 0.31 3.70 6.91 Random noise input 0.32 3.77 7.00 Gray-scale input 0.24 0.51 1.26 More details are shown in the Supplementary Material. Journal of Robotics 9 generated for meaningless data. But looking at the average error in each pixel, and the last column shows the standard deviation in each image, each averaged over all images. We mean error and standard deviation, we see that those generated depth maps have a very big error, thus showing can see that the average median error is 22% of the estimated depth, with a mean error of 40% and a standard deviation of that our result is clearly much better. 62%. Of course, those values show that the estimated depth is In the last row, we use the gray-scale version of the color quite inaccurate. Nevertheless, we believe that they are still frame as the input. As could be expected, this has reasonable, somewhat useful for certain applications, for example, nav- second-best results. Nevertheless, it is still worse than the igation, colorization, dehazing, or location fingerprinting. color input, so the color seems to be important. Comparing Furthermore, we hope that, in the future, those values can be the result of our method to all other tests, we see that the improved, for example by better and more training data and average median error, average mean error, and average by providing a few consecutive or stereo frames as input. standard deviation are much better for our approach, clearly showing that our approach does work to a certain extend. In order to better understand the properties of our ap- proach and put the evaluation results for our method into perspective, we use the same test frames to compare with 6. Conclusions three other cases. (e new row in Table 2 shows the results of the original FCRN, trained with the normal RGB images from (is paper presented a supervised depth learning method for Stanford 2D-3D-S. When testing this network with our real underwater spherical images. Firstly, we implemented style underwater data, we see that the average mean error and the transfer based on CycleGAN to synthesize the underwater average standard deviation are very big, compared to our images. (e results show that CycleGAN learned the features proposed approach. (is shows that using the CycleGAN of underwater scenarios and synthesizes nice images in the synthetic images during training is very advantageous. Even underwater style. (ose images are then used in a second though this does not prove that the CycleGAN provides a very network, a Fully Convolutional Residual Network (FCRN), realistic underwater transfer, it is a very strong indication to train underwater spherical depth estimation. (e network towards it. is trained in a supervised manner. Our first experiment was (e other two cases we show in Table 2 aim to show that using the synthetic images from CycleGAN for evaluation our approach is indeed doing something useful and not just and comparison with FCRN. Furthermore, we tested our giving some random values. Firstly, we make four different method on real underwater data from the Great Barrier Reef, fake depth results for comparison. (e “black result” depth for which we estimated sparse ground truth depth points image is all black (0 distance), the “white result” depth image using stereopsis and bundle adjustment. We also compared is all white, and the “random noise result” depth image has our results to artificial input and output data, to show that random distances. Finally, there is also a depth image called the network is indeed performing depth estimation. (e “gray-scale result,” which is simply the input underwater experiments demonstrated that the style transfer, as well as image in grey scale. Please note that, in the “black result” the depth estimation results, is convincing. Our method case, there are all 0’s in the image, so the scale parameter s achieves better results than training without GAN. It ach- cannot be obtained by the metric presented above. However, ieves slightly better results than FCRN trained with GAN, so any scale that acts on 0 is itself. (us, we just change the our updated loss function is beneficial. (e experiments also metric to a specific way, that is, setting scale parameter s � 1. showed that the estimated depth on real underwater images (en, the error in that case is always 1; thus, the standard is somewhat reasonable and better than all other methods deviation is 0. We can see that the evaluations of all those and options we compared to. fake results are much worse than our result. Nevertheless, the approach is far from perfect, especially Secondly, we used the same data as above (black, white, regarding the accuracy of the estimated depth. (is is mainly random noise, gray-scale input image) as the input to our due to the fact that estimating the depth from a single image is approach. (is can be regarded as a test to see if the network a very challenging task. Our approach is also not very general. is overfitting too much. Generating good results on (e underwater dataset was taken only at one location with meaningless data would be a clear indication of overfitting, very good visibility. (ere are many more underwater sce- for example, because the training data is not diverse enough. narios with differing styles. So, more underwater training data We can see that the average median error is in the range of is needed. In the future, we plan to work on a unified ap- our result. We think this is due to two reasons: (i) provided proach that can work in all kinds of different underwater with meaningless data, the network seems to generate depth situations. In addition, for testing in the real underwater images that somewhat resemble typical depth images; thus, it environments, we also plan to mask water-only areas by a might be overfitting a bit. (ii) (e rescaling process of our segmentation process. Collecting an in-air dataset with depth evaluation is optimizing the generated depth maps, such that that looks closer to the underwater images might also further they best fit the ground truth (for the underwater image that improve our performance. (ose might be some canyons or is not being used here). (e median error of that ground deserts. Since the underwater data we collected actually also truth may be quite small for those “typical” depth images contains spherical videos from two more cameras, we will 10 Journal of Robotics Pattern Analysis and Machine Intelligence, vol. 33, no. 12, investigate using this stereo data for depth training. Fur- pp. 2341–2353, 2011. thermore, more complicated network structures that take [8] M. Pfingsthorn, A. Birk, S. Schwertfeger, H. Bulow, ¨ and previous frames into account may provide even better results. K. Pathak, “Maximum likelihood mapping with spectral image registration,” in Proceedings of the 2010 IEEE Inter- Data Availability national Conference on Robotics and Automation, pp. 4282– 4287, Anchorage, AK, USA, May 2010. (e images of the underwater dataset, including the data [9] Y.-T. Peng, X. Zhao, and P. C. Cosman, “Single underwater for the ground truth evaluation, can be found on https:// image enhancement using depth estimation based on blur- robotics.shanghaitech.edu.cn/static/datasets/underwater/UW_ riness,” in Proceedings of the 2015 IEEE International Con- omni.tar.gz (780 MB). ference on Image Processing (ICIP), pp. 4952–4956, Quebec, Canada, September 2015. [10] P. Anandan, S. Gagliano, and M. Bucolo, “Computational Conflicts of Interest models in microfluidic bubble logic,” Microfluidics and (e authors declare that they have no conflicts of interest. Nanofluidics, vol. 18, no. 2, pp. 305–321, 2015. [11] F. Cairone, P. Anandan, and M. Bucolo, “Nonlinear systems synchronization for modeling two-phase microfluidics flows,” Supplementary Materials Nonlinear Dynamics, vol. 92, no. 1, pp. 75–84, 2018. [12] A. A. Argyros, K. E. Bekris, S. C. Orphanoudakis, and Tables S1, S2, and S3 show the median, mean, and standard L. E. Kavraki, “Robot homing by exploiting panoramic vi- deviation of the error between the ground truth and results sion,” Autonomous Robots, vol. 19, no. 1, pp. 7–25, 2005. estimated from different methods. (e column “ours” is the [13] R. Benosman, S. Kang, and O. Faugeras, Panoramic Vision, result estimated by the proposed method. (e “gray-scale” is Springer-Verlag New York, Berlin, Germany, 2000. converted from the input RGB image. (e remaining [14] H. Kuang, Q. Xu, X. Long, and S. Schwertfeger, “Pose esti- “random noise,” “white” and “black,” is generated manually. mation for omni-directional cameras using sinusoid fitting,” (e column with “result” is calculated by comparing the in Proceedings of the IEEE/RSJ International Conference on ground truth and the image directly whereas that with Intelligent Robots and Systems (IROS), Macau, China, No- “input” is computed by firstly taking the image as the input vember 2019. [15] T. Lemaire and S. Lacroix, “Slam with panoramic vision,” of the proposed network and then comparing the output Journal of Field Robotics, vol. 24, no. 1-2, pp. 91–111, 2007. with ground truth. (e “ours without GAN” denotes the [16] Q. Xu, A. Gomez Chavez, H. Bulow, ¨ A. Birk, and result about the model trained by the original in-air dataset, S. Schwertfeger, “Improved fourier mellin invariant for robust without CycleGAN. In addition, the “gt size” is the number rotation estimation with omni-cameras,” in Proceedings of the of points provided by ground truth. (Supplementary 2019 26th IEEE International Conference on Image Processing. Materials) IEEE, Taipei, Taiwan, September 2019. [17] B. Terry, “Dove: dolphin omni-directional video equipment,” References in Proceedings of the International Conference on Robotics and Automation, pp. 214–220, Paris, France, May 2000. [1] A. Gomez Chavez, Q. Xu, C. A. Mueller, S. Schwertfeger, and [18] J. Bosch, N. Gracias, P. Ridao, and D. Ribas, “Omnidirectional A. Birk, “Adaptive navigation scheme for optimal deep-sea underwater camera design and calibration,” Sensors, vol. 15, localization using multimodal perception cues,” in Proceed- no. 3, pp. 6033–6065, 2015. ings of the IEEE/RSJ International Conference on Intelligent [19] F. Bruno, G. Bianco, M. Muzzupappa, S. Barone, and Robots and Systems (IROS), Macau, China, November 2019. A. V. Razionale, “Experimentation of structured light and [2] J. Yuh and M. West, “Underwater robotics,” Advanced Ro- stereo vision for underwater 3d reconstruction,” ISPRS botics, vol. 15, no. 5, pp. 609–639, 2001. Journal of Photogrammetry and Remote Sensing, vol. 66, no. 4, [3] C. Beall, B. J. Lawrence, I. Viorela, and D. Frank, “3d re- pp. 508–518, 2011. construction of underwater structures,” in Proceedings of the [20] I. Armeni, S. Sax, A. R. Zamir, and S. Savarese, “Joint 2d-3d- 2010 IEEE/RSJ International Conference on Intelligent Robots semantic data for indoor scene understanding,” 2017, https:// and Systems, pp. 4418–4423, IEEE, Taipei, Taiwan, September arxiv.org/abs/1702.01105. [21] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image- [4] J. Li, K. A. Skinner, E. Ryan, and M. J.-R. Watergan, “Un- to-image translation using cycle-consistent adversarial net- supervised generative network to enable real-time color works,” in Proceedings of the 2017 IEEE International Con- correction of monocular underwater images,” IEEE Robotics ference on Computer Vision (ICCV), Venice, Italy, October and Automation Letters (RA-L), pp. 387–394, 2017. [5] P. L. J. Drews, E. R. Nascimento, S. S. C. Botelho, and [22] H. Kuang, Q. Xu, and S. Schwertfeger, “Depth estimation on M. F. Montenegro Campos, “Underwater depth estimation underwater omni-directional images using a deep neural and image restoration based on single images,” IEEE Com- network,” 2019, https://arxiv.org/abs/1905.09441. puter Graphics and Applications, vol. 36, no. 2, pp. 24–35, [23] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” in Proceedings [6] T. Łuczynski ´ and A. Birk, “Underwater image haze removal of the IEEE International Conference on Computer Vision, with an underwater-ready dark channel prior,” in OCEANS 2017, pp. 1–6, IEEE, Anchorage, AK, USA, September 2017. pp. 1422–1430, Santiago, Chile, December 2015. [24] J. Donahue, P. Krahenb ¨ uhl, ¨ and Trevor Darrell, “Adversarial [7] K. Kaiming He, J. Jian Sun, and X. Xiaoou Tang, “Single image haze removal using dark channel prior,” IEEE Transactions on feature learning,” 2016, https://arxiv.org/abs/1605.09782. Journal of Robotics 11 [25] A Dosovitskiy, P. Fischer, J. Tobias Springenberg, estimation and visual odometry with deep feature recon- M. Riedmiller, and T. Brox, “Discriminative unsupervised struction,” in Proceedings of the IEEE Conference on Computer feature learning with exemplar convolutional neural net- Vision and Pattern Recognition, pp. 340–349, Long Beach, CA, works,” IEEE Transactions on Pattern Analysis and Machine USA, June 2019. Intelligence, vol. 38, no. 9, pp. 1734–1747, 2015. [40] H. Zhan, C. S. Weerasekera, J. Bian, and I. Reid, “Visual [26] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised odometry revisited: what should be learnt?,” 2019, https:// representation learning by predicting image rotations,” in arxiv.org/abs/1909.09803. Proceedings of the International Conference on Learning [41] P.-Y. Chen, H. Alexander, Y.-C. Liu, and Y.-C. F. Wang, Representations, Vancouver, Canada, April 2018. “Towards scene understanding: unsupervised monocular [27] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colori- depth estimation with semantic-aware representation,” in zation,” in Proceedings of the European conference on com- Proceedings of the IEEE Conference on Computer Vision and puter vision, pp. 649–666, Amsterdam, Netherlands, October Pattern Recognition, pp. 2624–2632, Long Beach, CA, USA, June 2019. [28] C. Vondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and [42] Z. Yin and J. Shi, “Geonet: unsupervised learning of dense K. Murphy, “Tracking emerges by colorizing videos,” in depth, optical flow and camera pose,” in Proceedings of the Proceedings of the European Conference on Computer Vision IEEE Conference on Computer Vision and Pattern Recognition, (ECCV), pp. 391–408, Munich, Germany, September 2018. pp. 1983–1992, Long Beach, CA, USA, June 2019. [29] X. Wang and A. Gupta, “Unsupervised learning of visual [43] A. Ranjan, V. Jampani, L. Balles et al., “Competitive collab- representations using videos,” in Proceedings of the IEEE oration: joint unsupervised learning of depth, camera motion, International Conference on Computer Vision, pp. 2794–2802, optical flow and motion segmentation,” in Proceedings of the Santiago, Chile, December 2019. IEEE Conference on Computer Vision and Pattern Recognition, [30] E. Jang, C. Devin, V. Vincent, and S. Levine, “Grasp2vec: pp. 12240–12249, Long Beach, CA, USA, June 2019. learning object representations from self-supervised grasp- [44] H. Gupta and K. Mitra, “Unsupervised single image under- ing,” in Proceedings of the Conference on Robot Learning, water depth estimation,” in Proceedings of the 2019 IEEE Zurich, Switzerland, October 2018. International Conference on Image Processing (ICIP), [31] A. Nair, S. Bahl, K. Alexander, P. Vitchyr, G. Berseth, and pp. 624–628, Taipei, Taiwan, September 2019. S. Levine, “Contextual imagined goals for self-supervised [45] D. Paul, E. Nascimento, F. Moraes, S. Botelho, and robotic learning,” in Proceedings of the Conference on Robot M. Campos, “Transmission estimation in underwater single Learning, Osaka, Japan, October 2019. images,” in Proceedings of the IEEE International Conference [32] X. Zhi, X. He, and S. Schwertfeger, “Learning autonomous on Computer Vision Workshops, pp. 825–830, Sydney, Aus- exploration and mapping with semantic vision,” in Pro- tralia, April 2013. ceedings of the International Conference on Image, Video and [46] Y.-T. Peng and P. C. Cosman, “Underwater image restoration Signal Processing. IVSP, Shanghai China, February 2019. based on image blurriness and light absorption,” IEEE [33] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised Transactions on Image Processing, vol. 26, no. 4, pp. 1579– monocular depth estimation with left-right consistency,” in 1594, 2017. Proceedings of the IEEE Conference on Computer Vision and [47] X. Ding, Y. Wang, J. Zhang, and X. Fu, “Underwater image Pattern Recognition, pp. 270–279, Honolulu, HI, USA, July 2017. dehaze using scene depth estimation with adaptive color [34] H. Zhan, C. S. Weerasekera, R. Garg, and I. Reid, “Self-su- correction,” in OCEANS 2017, pp. 1–5, Aberdeen, Scotland, pervised learning for single view depth and surface normal June 2017. estimation,” in Proceedings of the 2019 International Con- [48] C. O Ancuti, C. Ancuti, C. De Vleeschouwer, L. Neumann, ference on Robotics and Automation (ICRA), pp. 4811–4817, and R. Garcia, “Color transfer for underwater dehazing and Montreal, Canada, May 2019. depth estimation,” in Proceedings of the 2017 IEEE Interna- [35] F. Ma, G. V. Cavalheiro, and S. Karaman, “Self-supervised tional Conference on Image Processing (ICIP), pp. 695–699, sparse-to-dense: self-supervised depth completion from lidar Beijing, China, September 2017. and monocular camera,” in Proceedings of the 2019 Inter- [49] K. A. Skinner, E. Iscar, and M. Johnson-Roberson, “Auto- national Conference on Robotics and Automation (ICRA), matic color correction for 3d reconstruction of underwater pp. 3288–3295, Montreal, Canada, May 2019. scenes,” in Proceedings of the 2017 IEEE International Con- [36] A. Wong and S. Soatto, “Bilateral cyclic constraint and ference on Robotics and Automation (ICRA), pp. 5140–5147, adaptive regularization for unsupervised monocular depth Singapore, May 2017. prediction,” in Proceedings of the IEEE Conference on Com- [50] J. S. Jaffe, “Computer modeling and the design of optimal puter Vision and Pattern Recognition, pp. 5644–5653, Long underwater imaging systems,” IEEE Journal of Oceanic En- Beach, CA, USA, June 2019. gineering, vol. 15, no. 2, pp. 101–111, 1990. [37] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsu- [51] B. L. McGlamery, “Computer analysis and simulation of pervised learning of depth and ego-motion from video,” in underwater camera system performance,” SIO Reference, Proceedings of the IEEE Conference on Computer Vision and vol. 75, no. 2, 1975. Pattern Recognition, pp. 1851–1858, Honolulu, HI, USA, July [52] K. A. Skinner, J. Zhang, E. A. Olson, and M. J.-R. Uwstereonet, “Unsupervised learning for depth estimation and color cor- [38] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, rection of underwater stereo imagery,” in Proceedings of the “Digging into self-supervised monocular depth estimation,” 2019 International Conference on Robotics and Automation in Proceedings of the IEEE International Conference on Computer Vision, pp. 3828–3838, Seoul, Korea, November (ICRA), pp. 7947–7954, Singapore, May 2019. [53] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and [39] H. Zhan, R. Garg, C. S. Weerasekera, K. Li, H. Agarwal, and N. Navab, “Deeper depth prediction with fully convolutional I. Reid, “Unsupervised learning of monocular depth residual networks,” in Proceedings of the 2016 Fourth 12 Journal of Robotics International Conference on 3D Vision (3DV), pp. 239–248, Stanford, California, October 2016. [54] K. Tateno, N. Navab, and F. Tombari, “Distortion-aware convolutional filters for dense prediction in panoramic im- ages,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 707–722, Munich, Germany, September [55] I. Goodfellow, J. Pouget-Abadie, M. Mirza et al., “Generative adversarial nets,” Advances in Neural Information Processing Systems, pp. 2672–2680, 2014. [56] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” Ad- vances in Neural Information Processing Systems, pp. 2366– 2374, 2014. [57] L. Jin, Y. Xu, Z. Jia et al., “Geometric structure based and regularized depth estimation from 360 indoor imagery,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 889–898, Seattle, WA, USA, June [58] O. Ronneberger, P. Fischer, and T. Brox, “U-net: convolu- tional networks for biomedical image segmentation,” in Proceedings of the International Conference on Medical image computing and computer-assisted intervention, pp. 234–241, Munich, Germany, October 2015. [59] Z. Zhang, Y. Xu, J. Yu, and S. Gao, “Saliency detection in 360 videos,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 488–503, Munich, Germany, September 2018. [60] H. Stewenius, D. Nister, F. Kahl, and F. Schaffalitzky, “A minimal solution for relative pose with unknown focal length,” in Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 2, pp. 789–794, San Diego, California, June [61] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, Cambridge University Press, Cambridge, UK, 2003.

Journal

Journal of RoboticsHindawi Publishing Corporation

Published: Jun 18, 2021

References