Access the full text.
Sign up today, get DeepDyve free for 14 days.
References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.
COMPUTER ASSISTED SURGERY 2019, VOL. 24, NO. S1, 30–35 https://doi.org/10.1080/24699322.2018.1557889 RESEARCH ARTICLE Ke Xu, Zhiyong Chen and Fucang Jia a b School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, China; Shenzhen Key Laboratory of Minimally Invasive Surgical Robotics and System, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China KEYWORDS ABSTRACT Depth estimation; 3D Minimally invasive laparoscopic surgery is associated with small wounds and short recovery reconstruction; laparoscopic time, reducing postoperative infections. Traditional two-dimensional (2D) laparoscopic imaging surgery; unsuper- lacks depth perception and does not provide quantitative depth information, thereby limiting vised learning the field of vision and operation during surgery. However, three-dimensional (3D) laparoscopic imaging from 2 D images lets surgeons have a depth perception. However, the depth informa- tion is not quantitative and cannot be used for robotic surgery. Therefore, this study aimed to reconstruct the accurate depth map for binocular 3 D laparoscopy. In this study, an unsuper- vised learning method was proposed to calculate the accurate depth while the ground-truth depth was not available. Experimental results proved that the method not only generated accur- ate depth maps but also provided real-time computation, and it could be used in minimally invasive robotic surgery. 1. Introduction Stereo matching mainly uses feature point match- ing or block matching to perform 3 D reconstruction Laparoscopic surgery (LS) has many advantages, such matching calculation and reconstructs a 3D scene as less bleeding and faster recovery, compared with according to image feature points or blocks. Penza open surgery. LS is now widely used in abdominal et al. [1] used a modified census transform to calculate surgery, for example, removal of liver tumors, resec- the similarity to find the matching regions correspond- tion of uterine fibroids, and so on. The surface recon- ing to the left and right images, and optimized dispar- struction of soft-tissue and organs is an important ity maps using the super-pixel method for 3D part of minimally invasive surgery. Traditional two- reconstruction. Luo et al. [2] compared the similarity dimensional (2D) laparoscopy has shortcomings in of the color and gradient of the two images of the left spatial orientation and identification of anatomical and right laparoscopies to find the best-matching fea- structures. Three-dimensional (3D) laparoscopy has ture area and used the bilateral filtering method to greatly improved the shortcomings of 2D laparos- optimize the disparity map for 3D reconstruction. copy. It not only provides surgeons with a visual However, the time complexity of this kind of 3D depth perception but also quantitative depth infor- reconstruction method was high, but the depth map mation for surgical navigation and robotic surgery. In accuracy was not high. binocular stereoscopic 3D imaging, accurate registra- Most SLAM algorithms achieve interframe estima- tion of depth maps and abdominal tissue is an tion and closed-loop detection by feature point important technical component of minimally invasive matching. For example, Mahmoud et al. [3] proposed robot-assisted surgery. The binocular stereo depth an improved parallel tracking and mapping method estimation has become a hot research spot in based on the ORB-SLAM to find new key-frame feature many countries. At present, the binocular 3D reconstruction method points for 3D reconstruction of porcine liver surface. of soft-tissue surface can be roughly divided into three However, its accuracy was not high. categories: stereo matching, simultaneous localization Laparoscopic 3D reconstruction studies based on and mapping (SLAM), and neural network. neural network are few, and most studies focused on CONTACT Fucang Jia fc.jia@siat.ac.cn 2019 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. COMPUTER ASSISTED SURGERY 31 Figure 1. Unsupervised binocular depth estimation network. natural scenes. Luo et al. [4] transformed natural scene corresponding depth image. An unsupervised learn- images into matching blocks for 3D reconstruction. ing-based binocular dense depth estimation network Antal [5] used each feature point of the two images of was trained on unlabeled calibrated laparoscopic bin- the left and right hepatic body membranes. The inten- ocular stereo image sequence data. The predicted sity values formed a set of 3D coordinates as the depth image was generated directly when the testing inputs, while the depth image was calculated by a calibrated dataset was input to the trained model. supervised learning neural network method. Zhou et al. [6] jointly trained a monocular disparity predic- 2.1. Binocular depth estimation network tion network using an unsupervised convolutional A nonlinear auto-encoder model was trained to esti- neural network and camera pose estimation networks, mate the depth map corresponding to a pair of RGB and these two networks were combined to compute images. The flowchart of the unsupervised binocular an unsupervised depth prediction network. Garg et al. depth estimation network is illustrated in Figure 1. [7] used the Alexnet network structure [8] to predict First, given the calibrated stereo image pairs I and I the monocular depth image and replace the last layer L R to the auto-encoder network, the corresponding dis- with a convolution layer to reduce the training param- parity maps (inverse depth) D and D were calculated. eters. The first two methods were deep predictive net- L R The spatial transformer network (STN) [10] was used works using supervised learning. The latter two for bilinear sampling D (D ) to generate I (I ). The L R L R methods used deep predictive networks for unsuper- image reconstruction process is illustrated with vised learning. straight lines and the loss function establishment with Unsupervised learning is more suitable for LS in- dashed lines in Figure 1. depth prediction networks because the ground-truth The auto-encoder network comprised two parts: depth map for laparoscopic soft-tissue and organs is encoder network and decoder network. The encoder difficult to obtain. network was inspired by the methods described in previous studies [11–13]. The deeper bottleneck archi- 2. Methods tectures [14] were adopted for the Resnet101 encoder The experimental data for this study came from the network, and the last layer of the fully connected layer Hamlyn Center Laparoscopic/Endoscopic Video was removed to reduce the number of parameters. Datasets [9]. In this study, the residual network was The encoding network architecture is summarized in used to predict the depth map of the soft-tissue sur- Table 1. The architecture with multiscale and skip plus face under LS for the first time. This method was an [15] was used in the decoder network part. The end-to-end approach where the input was a pair of method discussed in previous studies [6, 9] was used calibrated stereo images and the output was the in the disparity acquisition layer. The sigmoid 32 K. XU ET AL. Table 1. Encoder and decoder part. Encoder (Resnet101) Layer In/Out/K/S Number Output size Conv1 3/64/7/2 1 128 64 Pool –/–/3/2 1 64 32 Conv2_x Conv2_1 64/64/1/1 3 32 16 Conv2_2 64/64/3/– Conv2_3 64/256/1/1 Conv3_x Conv3_1 256/128/1/1 4 16 8 Conv3_2 128/128/1/– Conv3_3 128/512/1/1 Conv4_x Conv4_1 512/256/1/1 23 8 4 Conv4_2 256/256/3/– Conv4_3 256/1024/1/1 Conv5_x Conv5_1 1024/512/1/1 3 4 2 Conv5_2 512/512/3/– Conv5_3 512/2048/1/1 Decoder Layer In/Out/K/S Output size DeConv6_x DeConv5_3 2048/512/3/2 8 4 Plus6 DeConv5_3 þ Conv4_3 –/1536/–/– Conv6 Plus6 1536/512/3/1 DeConv5_x DeConv6 1024/256/3/2 16 8 Plus5 DeConv6 þ Conv3_3 –/768/–/– Conv5 Plus5 768/256/3/1 DeConv4_x DeConv5 512/128/3/2 32 16 Plus4 DeConv5 þ Conv2_3 –/384/–/– Conv4 Plus4 384/128/3/1 Disp4 Conv4 DeConv3_x DeConv4 128/64/3/2 64 32 Plus3 DeConv4 þ pool þ Disp4 –/130/–/– Conv3 Plus3 130/64/3/1 Disp3 Conv3 DeConv2_x DeConv3 96/32/3/2 128 64 Plus2 DeConv3 þ Conv1 þ Disp3 –/98/–/– Conv2 Plus2 98/32/3/1 Disp2 Conv2 DeConv1_x DeConv2 32/16/3/2 256 128 Plus1 DeConv2 þ Disp2 16/18/–/– Conv1 Plus1 18/16/3/1 Disp1 Conv1 Conv, Convolution; pool, max pooling; Conv _x, convolution blocks; DeConv, deconvolution; DeConv _x, deconvolution block; Disp, disparity layer; In, input channels; K, kernel size; Number, block number; Out, output channels; Output size, output image size; Plus, skip connection; S, stride; , upsam- pling factor 2. 1 2 activation function was used in the convolution layer C ¼ j1 SSIM IðÞ i; j ; I ðÞ i; j j (2) SSIM L i;j N 5 to obtain the depth image. The third part was the reconstruction error loss between the input image I (i,j) and the reconstruction 2.2. Binocular depth estimation loss function L image I (i, j) (the right counterpart isC ): L RECR The loss function was minimized to train the unsuper- vised binocular depth estimation network. The loss C ¼ jI ði; jÞ I ði; jÞj (3) REC L i;j function included three parts. The first part was the left–right consistency loss of the error calculated by Four layers of loss function occurred at different the L1 metric C between the predicted left disparity LR scales, and the scale factor was 2. The total loss func- D and right disparity D , where (i, j) is the pixel index L R tion was as follows, and a ¼ b = k ¼ 1. of the image: X C ¼ aðÞ C þ C þ bðÞ C þ C LR LRR REC RECR s¼1 C ¼ jD ði; jÞ D ði þ D ði; jÞ; jÞj (1) LR L R L i;j þkðÞ C þ C Þ SSIM SSIMR The second part was the structural similarity loss 2.3. Training details C (where SSIM is the structural similarity index) of SSIM the error between the input image and the recon- An unsupervised binocular depth estimation method struction image (the right counterpart is C ) SSIMR was implemented using the TensorFlow framework on COMPUTER ASSISTED SURGERY 33 Figure 2. Example results of the three methods. The left two columns are the input images; the third column is the Siamese result; the fourth column is the Basic result; and the last column is the result of this study. Green boxes indicate comparisons of different results under the same organization. Nvidia Tesla P100 GPU (16 GB). An exponential activation Table 2. Comparison of evaluation results between the basic and the methods used in this study. function was used in each convolution and deconvolu- Method Basic Present study tion except for convolution to obtain the disparity map. Mean SSIM 0.5414 ± 0.0709 0.8349 ± 0.0523 The Adam optimizer was used. The network had 50 Mean PNSR 7.7650 ± 1.3686 14.4957 ± 1.9676 epochs on the training datasets, and the initial learning PSNR, Peak signal-to-noise ratio; SSIM, structural similarity index. -4 rate was set to 10 . The batch sizes were 16, and the total training time was about 8 h. The images were No ground-truth result was available for the data- resized to 256 128 to reduce the computational time. set. Therefore, the performance was compared with all The number of parameters was about 9.5 10 . published results, and the best results were taken as the ground-truth result for evaluation using SSIM and the peak signal-to-noise ratio (PSNR). The average 3. Results evaluation value of the 7191 pairs of calibrated stereo The unsupervised binocular Resnet network depth images in the testing set was evaluated. The results estimation method was compared with the basic [14] are described in Table 2. The time for generating the (unsupervised single convolutional neurla network predicted depth image was about 16 ms. CNN) and Siamese [14] (unsupervised binocular CNN) The 3D reconstruction was performed on the left methods illustrated in Figure 2. The higher intensity image with the corresponding disparity map and the meant that the distance to the camera was closer. internal and external parameters of the left camera of 34 K. XU ET AL. Figure 3. An example of 3 D reconstruction. (a) Left image. (b) Disparity map. (c) Post-processing. (d) 3 D reconstruction. the 3D laparoscopy. In the process of 3D reconstruc- 16 ms, which could fulfill the real-time display require- ments of real surgical scenes because the calculation tion, an error appeared on the left side of the disparity map due to the occlusion of the laparoscopy, as of the depth images was the most time-consuming shown in Figure 3(b). We cut the occluded part and part of the 3D reconstruction. The future studies would train abdominal soft-tissue the remaining part is shown in Figure 3(c), and the surface depth estimation networks through transfer remaining part is reconstructed as shown in learning and ensemble learning with fine-tuning, Figure 3(d). enhancing the robustness and accuracy further. 4. Discussion Funding The results of the present study were found to be bet- This study was supported by the National Key Research and ter than those obtained using basic methods and simi- Development Program [Nos. 2016YFC0106500/2 and lar to those obtained using the Siamese method SQ2017ZY040217/03], the NSFC-Guangdong Union Grant (Figure 2 and Table 2). For example, the green boxes [No. U1401254], the NSFC-Shenzhen Union Grant [No. in Figure 2 show a whole piece of prominent human U1613221], the Guangdong Scientific and Technology Program [No. 2015B020214005], the Shenzhen Key Basic tissue. The right half of the tissue is covered with Science Program [No. JCYJ20170413162213765], and the blood, indicating that the tissue was at the same dis- Shenzhen Key Laboratory Project under Grant tance from the camera and had same brightness. The ZDSYS201707271637577. result correctly shows the depth map of the cov- ered part. 7. References In the 3D reconstruction in Figure 3, only pixels were mapped to color in the left image to spatial 3D [1] Penza V, Ortiz J, Mattos LS, et al. Dense soft tissue 3D coordinates, showing the correctness of the estimated reconstruction refined with super-pixel segmentation for robotic abdominal surgery. Int J Cars. 2016;11: depth values and the superiority of the 3D reconstruc- 197–206. tion results. [2] Luo XB, Jayarathne UL, Pautler SE, et al. Binocular endoscopic 3-D scene reconstruction using color and gradient-boosted aggregation stereo matching for 5. Conclusions robotic surgery. In: Zhang Y-J, editor. ICIG 2015, Part In this study, a novel end-to-end depth prediction net- I, LNCS 9217: 2015. p. 664–676. Springer, Charm, Switzerland. work method was proposed for laparoscopic soft-tis- [3] Mahmoud N, Cirauqui I. ORBSLAM-based endoscopic sue 3D reconstruction. The residual network was first tracking and 3D reconstruction. In: Peters T et al. used in the depth estimation of binocular laparoscopic (Eds.), International workshop on computer-assisted soft-tissue surface to generate better dense prediction and robotic endoscopy. 2016. p. 72–83. Springer, depth maps. The time to generate a map was only Charm, Switzerland. COMPUTER ASSISTED SURGERY 35 [4] Antal B. Automatic 3D point set reconstruction from [10] Jaderberg M, Simonyan K, Zisserman A, et al. Spatial stereo endoscopic images using deep neural net- transformer networks. In: Cortes C et al. (Eds.) works. In: Ahrens A and Benavente-Peces C (Eds.), Advances in Neural Information Processing Systems Proceedings of the 6th International Joint Conference 28. 2015. p. 2017–2025. Curran Associates, Inc., Red on Pervasive and Embedded Computing and Hook, NY, USA. Communication Systems. 2016. p. 116–121. [11] Mayer N, Ilg E, Hausser P, et al. A large dataset to SciTePress, Setubal, Portugal. train convolution networks for disparity, optical flow, [5] Luo WJ, Chwing AGS. Efficient deep learning for and scene flow estimation. In: Tuytelaars T et al. stereo matching. In: Tuytelars T et al. (Eds.), IEEE, Inc., (Eds.). IEEE Conference on Computer Vision and IEEE Conference on computer Vision and Pattern Pattern Recognition; 2016. p. 4040–4048. IEEE, Inc., Recongnition. 2016. p. 5695–5713. Los Alamitos, CA, Los Alamitos, CA, USA. USA. [12] Milletari F, Navab N, Ahmadi SA. V-Net: Fully convolu- [6] Zhou TH, Brown M, Snavely N, et al. Unsupervised tional neural networks for volumetric medical image learning of depth and ego-motion from video, In: segmentation. In: Savarese S (Eds.) Fourth Chellappa R et al. (Eds.) IEEE Conference on International Conference on 3D Vision (3DV). 2016. p. Computer Vision and Pattern Recognition (CVPR). 565–571. IEEE, Inc., Los Alamitos, CA, USA. arXiv pre- 2017. p. 6612–6619. IEEE, Inc., Los Alamitos, CA, USA. [7] Garg R, Vijay Kumar BG, Carneiro G, et al. print arXiv:1704.07813. Unsupervised CNN for single view depth estimation: [13] Eigen D, Puhrsch C, Fergus, R. Depth map prediction geometry to the rescue. In: Leibe B et al (Eds.). from a single image using a multi-scale deep net- European Conference on Computer Vision. 2016. work. In: Ghahramani Z et al. (Eds.) Advances in p. 740–756. Springer, Charm, Switzerland. Neural Information Processing Systems 27. 2014. [8] Krizhevsky A, Sutskever I, Hinton GE. ImageNet classi- p. 2366–2374. Curran Associates, Inc., Red Hook, fication with deep convolutional neural networks. NY, USA In: Pereira F et al. (Eds.) International Conference [14] He K, Zhang X, Ren S, et al. Deep residual learning for on Neural Information Processing Systems. image recognition. In: Bischof H et al. (Eds.) IEEE 60(2):1097–1105. 2012. Curran Associates, Inc., Red Conference on computer Vision and Pattern Hook, NY, USA. Recognition. 2015. p. 770–778. IEEE, Inc., Los [9] Ye M, Johns E, Handa A, et al. Self-supervised Alamitos, CA, USA. Siamese learning on stereo image pairs for depth esti- [15] Godard C, Aodha OM, Brostow GJ. Unsupervised mon- mation in robotic surgery. In: Yang G-Z (Eds.) ocular depth estimation with left-right consistency. In: Proceedings of the Hamlyn Symposium on Medical Chellappa R et al. (Eds.). IEEE Conference on Robotics. 2017. p. 27-28. Imperial College London and Computer Vision and Pattern Recognition. 2017. the Royal Geographical Society, London, UK. 2017. arXiv preprint arXiv:1705.08260. p. 6602–6611. IEEE, Inc, Los Alamitos, CA, USA.
Computer Assisted Surgery – Taylor & Francis
Published: Oct 1, 2019
Keywords: Depth estimation; 3D reconstruction; laparoscopic surgery; unsupervised learning
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.