1-2hit |
As a further investigation of the image captioning task, some works extended the vision-text dataset for specific subtasks, such as the stylized caption generating. The corpus in such dataset is usually composed of obvious sentiment-bearing words. While, in some special cases, the captions are classified depending on image category. This will result in a latent problem: the generated sentences are in close semantic meaning but belong to different or even opposite categories. It is a worthy issue to explore an effective way to utilize the image category label to boost the caption difference. Therefore, we proposed an image captioning network with the label control mechanism (LCNET) in this paper. First, to further improve the caption difference, LCNET employs a semantic enhancement module to provide the decoder with global semantic vectors. Then, through the proposed label control LSTM, LCNET can dynamically modulate the caption generation depending on the image category labels. Finally, the decoder integrates the spatial image features with global semantic vectors to output the caption. Using all the standard evaluation metrics shows that our model outperforms the compared models. Caption analysis demonstrates our approach can improve the performance of semantic representation. Compared with other label control mechanisms, our model is capable of boosting the caption difference according to the labels and keeping a better consistent with image content as well.
Shan HE Yuanyao LU Shengnan CHEN
The development of deep learning and neural networks has brought broad prospects to computer vision and natural language processing. The image captioning task combines cutting-edge methods in two fields. By building an end-to-end encoder-decoder model, its description performance can be greatly improved. In this paper, the multi-branch deep convolutional neural network is used as the encoder to extract image features, and the recurrent neural network is used to generate descriptive text that matches the input image. We conducted experiments on Flickr8k, Flickr30k and MSCOCO datasets. According to the analysis of the experimental results on evaluation metrics, the model proposed in this paper can effectively achieve image caption, and its performance is better than classic image captioning models such as neural image annotation models.