Access the full text.
Sign up today, get DeepDyve free for 14 days.
Christian Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, Scott Reed, Dragomir Anguelov, D. Erhan, Vincent Vanhoucke, Andrew Rabinovich (2014)
Going deeper with convolutions2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
K. Simonyan, Andrew Zisserman (2014)
Very Deep Convolutional Networks for Large-Scale Image RecognitionCoRR, abs/1409.1556
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, Jiebo Luo (2016)
Image Captioning with Semantic Attention2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Junhua Mao, W. Xu, Yi Yang, Jiang Wang, A. Yuille (2014)
Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)arXiv: Computer Vision and Pattern Recognition
Samy Bengio, Oriol Vinyals, N. Jaitly, Noam Shazeer (2015)
Scheduled Sampling for Sequence Prediction with Recurrent Neural NetworksArXiv, abs/1506.03099
Steven Rennie, E. Marcheret, Youssef Mroueh, Jerret Ross, V. Goel (2016)
Self-Critical Sequence Training for Image Captioning2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Andrej Karpathy, Li Fei-Fei (2015)
Deep visual-semantic alignments for generating image descriptionsProceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, Svetlana Lazebnik (2015)
Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence modelsProceedings of the IEEE International Conference on Computer Vision
Tomas Mikolov, M. Karafiát, L. Burget, J. Černocký, S. Khudanpur (2010)
Recurrent neural network based language model
Feng Liu, T. Xiang, Timothy Hospedales, Wankou Yang, Changyin Sun (2016)
Semantic Regularisation for Recurrent Image Annotation2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Y. Ushiku, T. Harada, Y. Kuniyoshi (2012)
Efficient image annotation for automatic sentence generationProceedings of the 20th ACM international conference on Multimedia
Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin, Nenghai Yu, Tie-Yan Liu (2017)
Deliberation Networks: Sequence Generation Beyond One-Pass Decoding
Oriol Vinyals, Alexander Toshev, Samy Bengio, D. Erhan (2014)
Show and tell: A neural image caption generator2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Peter Anderson, Basura Fernando, Mark Johnson, Stephen Gould (2016)
SPICE: Semantic Propositional Image Caption EvaluationArXiv, abs/1607.08822
Ke Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, R. Salakhutdinov, R. Zemel, Yoshua Bengio (2015)
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
S. Banerjee, A. Lavie (2005)
METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
Ryan Kiros, R. Salakhutdinov, R. Zemel (2014)
Unifying Visual-Semantic Embeddings with Multimodal Neural Language ModelsArXiv, abs/1411.2539
S. Hochreiter, J. Schmidhuber (1997)
Long Short-Term MemoryNeural Computation, 9
Kaiming He, X. Zhang, Shaoqing Ren, Jian Sun (2015)
Deep Residual Learning for Image Recognition2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Siqi Liu, Zhenhai Zhu, Ning Ye, S. Guadarrama, K. Murphy (2016)
Improved Image Captioning via Policy Gradient optimization of SPIDEr2017 IEEE International Conference on Computer Vision (ICCV)
Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton (2012)
Imagenet classification with deep convolutional neural networksAdvances in Neural Information Processing Systems
Chin-Yew Lin (2004)
ROUGE: A Package for Automatic Evaluation of Summaries
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, Alan Yuille (2014)
Deep captioning with multimodal recurrent neural networks (m-rnn)arXiv preprint arXiv:1412.6632 (2014).
Gao Huang, Zhuang Liu, Kilian Weinberger (2016)
Densely Connected Convolutional Networks2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, C. Zitnick (2015)
Microsoft COCO Captions: Data Collection and Evaluation ServerArXiv, abs/1504.00325
Hui Chen, Guiguang Ding, Sicheng Zhao, J. Han (2018)
Temporal-Difference Learning With Sampling Baseline for Image Captioning
Liang Yang, Haifeng Hu (2019)
Adaptive Syncretic Attention for Constrained Image CaptioningNeural Processing Letters
Hao Fang, Saurabh Gupta, F. Iandola, R. Srivastava, L. Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John Platt, C. Zitnick, G. Zweig (2014)
From captions to visual concepts and back2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
A. Karpathy, Li Fei-Fei (2014)
Deep Visual-Semantic Alignments for Generating Image DescriptionsIEEE Transactions on Pattern Analysis and Machine Intelligence, 39
Zhilin Yang, Ye Yuan, Yuexin Wu, William Cohen, R. Salakhutdinov (2016)
Review Networks for Caption Generation
Marc'Aurelio Ranzato, S. Chopra, Michael Auli, Wojciech Zaremba (2015)
Sequence Level Training with Recurrent Neural NetworksCoRR, abs/1511.06732
Ting Yao, Yingwei Pan, Yehao Li, Tao Mei (2018)
Exploring Visual Relationship for Image Captioning
Vinod Nair, Geoffrey Hinton (2010)
Rectified Linear Units Improve Restricted Boltzmann Machines
Diederik Kingma, Jimmy Ba (2014)
Adam: A Method for Stochastic OptimizationCoRR, abs/1412.6980
Jiasen Lu, Caiming Xiong, Devi Parikh, R. Socher (2016)
Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Ramakrishna Vedantam, C. Zitnick, Devi Parikh (2014)
CIDEr: Consensus-based image description evaluation2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, Lei Zhang (2017)
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Jacob Devlin, Saurabh Gupta, Ross Girshick, Margaret Mitchell, C. Zitnick (2015)
Exploring Nearest Neighbor Approaches for Image CaptioningArXiv, abs/1505.04467
Weixuan Wang, Haifeng Hu (2019)
Image Captioning Using Region-Based Attention Joint with Time-Varying AttentionNeural Processing Letters, 50
Bryan Plummer, Liwei Wang, Christopher Cervantes, Juan Caicedo, J. Hockenmaier, S. Lazebnik (2015)
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence ModelsInternational Journal of Computer Vision, 123
Xu Jia, E. Gavves, Basura Fernando, T. Tuytelaars (2015)
Guiding the Long-Short Term Memory Model for Image Caption Generation2015 IEEE International Conference on Computer Vision (ICCV)
Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, Tat-Seng Chua (2016)
SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Xinlei Chen, C. Zitnick (2015)
Mind's eye: A recurrent visual representation for image caption generation2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Liang Yang, Haifeng Hu (2017)
TVPRNN for image caption generationElectronics Letters, 53
Anqi Wang, Haifeng Hu, Liang Yang (2018)
Image Captioning with Affective Guiding and Selective AttentionACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 14
Olga Russakovsky, Jia Deng, Hao Su, J. Krause, S. Satheesh, Sean Ma, Zhiheng Huang, A. Karpathy, A. Khosla, Michael Bernstein, A. Berg, Li Fei-Fei (2014)
ImageNet Large Scale Visual Recognition ChallengeInternational Journal of Computer Vision, 115
Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio (2014)
Neural Machine Translation by Jointly Learning to Align and TranslateCoRR, abs/1409.0473
A. Krizhevsky, Ilya Sutskever, Geoffrey Hinton (2012)
ImageNet classification with deep convolutional neural networksCommunications of the ACM, 60
Renyu Ye, Xinsheng Liu, Hai Zhang, Jinde Cao (2019)
Global Mittag-Leffler Synchronization for Fractional-Order BAM Neural Networks with Impulses and Multiple Variable Delays via Delayed-Feedback Control StrategyNeural Processing Letters, 49
Jeff Donahue, Lisa Hendricks, Marcus Rohrbach, Subhashini Venugopalan, S. Guadarrama, Kate Saenko, Trevor Darrell (2014)
Long-term recurrent convolutional networks for visual recognition and description2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
K. Papineni, Salim Roukos, T. Ward, Wei-Jing Zhu (2002)
Bleu: a Method for Automatic Evaluation of Machine Translation
In this article, we propose a novel Pseudo-3D Attention Transfer network with Content-aware Strategy (P3DAT-CAS) for the image captioning task. Our model is composed of three parts: the Pseudo-3D Attention (P3DA) network, the P3DA-based Transfer (P3DAT) network, and the Content-aware Strategy (CAS). First, we propose P3DA to take full advantage of three-dimensional (3D) information in convolutional feature maps and capture more details. Most existing attention-based models only extract the 2D spatial representation from convolutional feature maps to decide which area should be paid more attention to. However, convolutional feature maps are 3D and different channel features can detect diverse semantic attributes associated with images. P3DA is proposed to combine 2D spatial maps with 1D semantic-channel attributes and generate more informative captions. Second, we design the transfer network to maintain and transfer the key previous attention information. The traditional attention-based approaches only utilize the current attention information to predict words directly, whereas transfer network is able to learn long-term attention dependencies and explore global modeling pattern. Finally, we present CAS to provide a more relevant and distinct caption for each image. The captioning model trained by maximum likelihood estimation may generate the captions that have a weak correlation with image contents, resulting in the cross-modal gap between vision and linguistics. However, CAS is helpful to convey the meaningful visual contents accurately. P3DAT-CAS is evaluated on Flickr30k and MSCOCO, and it achieves very competitive performance among the state-of-the-art models.
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) – Association for Computing Machinery
Published: Aug 8, 2019
Keywords: Image captioning
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.