Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Pseudo-3D Attention Transfer Network with Content-aware Strategy for Image Captioning

Pseudo-3D Attention Transfer Network with Content-aware Strategy for Image Captioning In this article, we propose a novel Pseudo-3D Attention Transfer network with Content-aware Strategy (P3DAT-CAS) for the image captioning task. Our model is composed of three parts: the Pseudo-3D Attention (P3DA) network, the P3DA-based Transfer (P3DAT) network, and the Content-aware Strategy (CAS). First, we propose P3DA to take full advantage of three-dimensional (3D) information in convolutional feature maps and capture more details. Most existing attention-based models only extract the 2D spatial representation from convolutional feature maps to decide which area should be paid more attention to. However, convolutional feature maps are 3D and different channel features can detect diverse semantic attributes associated with images. P3DA is proposed to combine 2D spatial maps with 1D semantic-channel attributes and generate more informative captions. Second, we design the transfer network to maintain and transfer the key previous attention information. The traditional attention-based approaches only utilize the current attention information to predict words directly, whereas transfer network is able to learn long-term attention dependencies and explore global modeling pattern. Finally, we present CAS to provide a more relevant and distinct caption for each image. The captioning model trained by maximum likelihood estimation may generate the captions that have a weak correlation with image contents, resulting in the cross-modal gap between vision and linguistics. However, CAS is helpful to convey the meaningful visual contents accurately. P3DAT-CAS is evaluated on Flickr30k and MSCOCO, and it achieves very competitive performance among the state-of-the-art models. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) Association for Computing Machinery

Pseudo-3D Attention Transfer Network with Content-aware Strategy for Image Captioning

Loading next page...
 
/lp/association-for-computing-machinery/pseudo-3d-attention-transfer-network-with-content-aware-strategy-for-jKzDOYAyk8

References (51)

Publisher
Association for Computing Machinery
Copyright
Copyright © 2019 ACM
ISSN
1551-6857
eISSN
1551-6865
DOI
10.1145/3336495
Publisher site
See Article on Publisher Site

Abstract

In this article, we propose a novel Pseudo-3D Attention Transfer network with Content-aware Strategy (P3DAT-CAS) for the image captioning task. Our model is composed of three parts: the Pseudo-3D Attention (P3DA) network, the P3DA-based Transfer (P3DAT) network, and the Content-aware Strategy (CAS). First, we propose P3DA to take full advantage of three-dimensional (3D) information in convolutional feature maps and capture more details. Most existing attention-based models only extract the 2D spatial representation from convolutional feature maps to decide which area should be paid more attention to. However, convolutional feature maps are 3D and different channel features can detect diverse semantic attributes associated with images. P3DA is proposed to combine 2D spatial maps with 1D semantic-channel attributes and generate more informative captions. Second, we design the transfer network to maintain and transfer the key previous attention information. The traditional attention-based approaches only utilize the current attention information to predict words directly, whereas transfer network is able to learn long-term attention dependencies and explore global modeling pattern. Finally, we present CAS to provide a more relevant and distinct caption for each image. The captioning model trained by maximum likelihood estimation may generate the captions that have a weak correlation with image contents, resulting in the cross-modal gap between vision and linguistics. However, CAS is helpful to convey the meaningful visual contents accurately. P3DAT-CAS is evaluated on Flickr30k and MSCOCO, and it achieves very competitive performance among the state-of-the-art models.

Journal

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)Association for Computing Machinery

Published: Aug 8, 2019

Keywords: Image captioning

There are no references for this article.