Link of the Paper: https://arxiv.org/pdf/1412.6632.pdf
Main Points:
- A multimodal Recurrent Neural Networks ( CNNs + a multimodal layer + RNNs )
- The m-RNN model is trained using a log-likelihood cost function. The errors can be backpropagated to the three parts of the m-RNN model to update the model parameters simultaneously.
Other Key Points:
- Applications for Image Captioning: early childhood education, image retrieval, and navigation for the blind.
- Many previous methods treat the task of describing images as a retrieval task and formulate the problem as a ranking or embedding learning problem. They first extract the word and sentence features ( e.g. Socher et al.(2014) uses dependency tree Recursive Neural Network to extract sentence features ) as well as the image features. Then they optimize a ranking cost to learn an embedding model that maps both the sentence feature and the image feature to a common semantic feature space ( the same semantic space ). In this way, they can directly calculate the distance between images and sentences. These methods genarate image captions by retrieving them from a sentence database. Thus, they lack the ability of generating novel sentences or describing images that contain novel combinations of objects and scenes.
- Benchmark datasets for Image Captioning: IAPR TC-12 ( Grubinger et al.(2006) ), Flickr 8K ( Rashtchian et al.(2010) ), Flickr 30K ( Young et al.(2014) ) and MS COCO ( Lin et al.(2014) ).
- Tasks related to Image Captioning: Generating Novel Sentences, Retrieving Images Given a Sentence, Retrieving Sentences Given an Image.