时序不限：

1.Scene Graph Generation from Objects, Phrases and Region Captions

2018ICCV Yikang Li The Chinese University of Hong Kong

考虑的问题：1）物体检测，场景图生成和图像标题是不同语义层次的三个场景理解，用场景图（文本）描述检测到对象之间的视觉关系

解决：提出了一种新的神经网络模型，称为多级场景描述网络（表示为MSDN），以端到端的方式联合解决三个视觉任务该网络细化处理为五个部分，通过VGG16训练得出图像特征1.区域RPN,2. FC层专业化,3.基于ROI构造动态图, 4.对象精炼，5,生成场景图

2.Scene Graph Generation by Iterative Message Passing

CVPR2017Danfei Xu1 Yuke Zhu1 Christopher B. Choy2 Li Fei-Fei1
https://github.com/danfeiX/scene-graph-TF-release

解决的问题：1）图像语义理解需要搞清楚object之间的关联，有个直接的办法就是生成关系图来建模relationship。如何生成scene graph，是从图像构建，还是从图像与语句对来构建？本文采用的是完全从图像中构建。

解决：关于这里所说的关系往往指主-谓-宾的结构，预测关系就是可以理解为给定主语和宾语时，能够准确预测谓语。

网络构成：

1.fast-RCNN检测得到object

2.采用CRF（条件随机场）来建模，推断节点和边的关系

3.使用两个GRU来表示所有node和edge

3.将graph的推断迭代问题建模为RNN方式

4.利用message passing提高精度。

5.CRF利用平均场方法来解，(可以认为图是由顶点和边构成二分结构，利用平均场，所有对某个定点的影响是由其所连接的所有边造成的，反过来对于边也可解释

3.Visual Relationship Detection with Language Priors
https://github.com/Prof-Lu-Cewu/Visual-Relationship-Detection

在视觉模型之外引入了语言模型，将关系映射到一个embedding space(embedding我的理解就是抽象为向量)，通过学习关系先验使得相近的关系向量距离很近，输出可能性评分。

4.Pixels to Graphs by Associative Embedding
https://github.com/princeton-vl/px2graph

引入Human Pose Estimation中的associative embedding。构造一个像素级网络输出object和relationship的热点图并导出像素级features，将带有id的features导入全连接层预测object和relationship。

Im2Text：使用100万张标题照片描述图像

论文：http：//tamaraberg.com/papers/generation_nips2011.pdf
项目：http：//vision.cs.stonybrook.edu/~vicente/sbucaptions/

用于视觉识别和描述的长期循环卷积网络

简介：2015年CVPR的口头报告.LRCN
项目页面：http：//jeffdonahue.com/lrcn/
arxiv：http：//arxiv.org/abs/1411.4389
github：https：//github.com/BVLC/caffe/pull/2033

展示并演讲

显示和说明：神经图像标题生成器

介绍：谷歌
arxiv：http：//arxiv.org/abs/1411.4555
github：https：//github.com/karpathy/neuraltalk
gitxiv：http：//gitxiv.com/posts/7nofxjoYBXga5XjtL/show-and-tell-a-neural-image-caption-nic-generator
github：https：//github.com/apple2373/chainer_caption_generation
github（TensorFlow）：https：//github.com/tensorflow/models/tree/master/im2txt
github（TensorFlow）：https：//github.com/zsdonghao/Image-Captioning

CNN和LSTM生成图像标题

博客：http：//t-satoshi.blogspot.com/2015/12/image-caption-generation-by-cnn-and-lstm.html
github：https：//github.com/jazzsaxmafia/show_and_tell.tensorflow

展示和讲述：从2015年MSCOCO图像字幕挑战中吸取的教训

arxiv：http：//arxiv.org/abs/1609.06647
github：https：//github.com/tensorflow/models/tree/master/im2txt

学习图像标题生成的周期性视觉表示

arxiv：http：//arxiv.org/abs/1411.5654

Mind's Eye：图像标题生成的周期性视觉表示

简介：CVPR 2015
论文：http：//www.cs.cmu.edu/~xinleic/papers/cvpr15_rnn.pdf

用于生成图像描述的深层视觉语义对齐

介绍：“提出一个多模态深度网络，使用CNN特征表示图像的各个有趣区域与相关词汇对齐。然后，所学习的对应关系用于训练双向RNN。该模型不仅能够生成图像描述，还能够将句子的不同部分定位到相应的图像区域。“
项目页面：http：//cs.stanford.edu/people/karpathy/deepimagesent/
arxiv：http：//arxiv.org/abs/1412.2306
幻灯片：http：//www.cs.toronto.edu/~vendrov/DeepVisualSemanticAlignments_Class_Presentation.pdf
github：https：//github.com/karpathy/neuraltalk
演示：http：//cs.stanford.edu/people/karpathy/deepimagesent/rankingdemo/

多模态回归神经网络的深字幕

介绍：m-RNN。ICLR 2015
简介：“在RNN的嵌入和复发层之后，通过引入新的多模式层，结合了CNN和RNN的功能。”
主页：http：//www.stat.ucla.edu/~junhua.mao/m-RNN.html
arxiv：http：//arxiv.org/abs/1412.6632
github：https：//github.com/mjhucla/mRNN-CR
github：https：//github.com/mjhucla/TF-mRNN

显示，出席和讲述

显示，参与和讲述：视觉注意的神经图像标题生成（ICML 2015）

项目页面：http：//kelvinxu.github.io/projects/capgen.html
arxiv：http：//arxiv.org/abs/1502.03044
github：https：//github.com/kelvinxu/arctic-captions
github：https：//github.com/jazzsaxmafia/show_attend_and_tell.tensorflow
github（TensorFlow）：https：//github.com/yunjey/show-attend-and-tell-tensorflow
演示：http：//www.cs.toronto.edu/~rkiros/abstract_captions.html

自动描述历史照片

网站：https：//staff.fnwi.uva.nl/d.elliott/loc/

像孩子一样学习：从图像句子描述学习快速小说视觉概念

arxiv：http：//arxiv.org/abs/1504.06692
主页：http：//www.stat.ucla.edu/~junhua.mao/projects/child_learning.html
github：https：//github.com/mjhucla/NVC-Dataset

明确的高级概念在视觉上对语言问题有什么价值？

arxiv：http：//arxiv.org/abs/1506.01144

调整在哪里看和说什么：图像标题与基于区域的注意和场景分解

arxiv：http：//arxiv.org/abs/1506.06272

使用CNN过滤器学习FRAME模型以进行知识可视化（CVPR 2015）

项目页面：http：//www.stat.ucla.edu/~yang.lu/project/deepFrame/main.html
arxiv：http：//arxiv.org/abs/1509.08379
代码+数据：http：//www.stat.ucla.edu/~yang.lu/project/deepFrame/doc/deepFRAME_1.1.zip

从注意的字幕生成图像

arxiv：http：//arxiv.org/abs/1511.02793
github：https：//github.com/emansim/text2image
演示：http：//www.cs.toronto.edu/~emansim/cap2im.html

图像和语言的顺序嵌入

arxiv：http：//arxiv.org/abs/1511.06361
github：https：//github.com/ivendrov/order-embedding

DenseCap：用于密集字幕的完全卷积定位网络

项目页面：http：//cs.stanford.edu/people/karpathy/densecap/
arxiv：http：//arxiv.org/abs/1511.07571
github（Torch）：https：//github.com/jcjohnson/densecap

用一系列自然句子表达一个图像流

简介：NIPS 2015. CRCN
nips-page：http：//papers.nips.cc/paper/5776-expressing-an-image-stream-with-a-sequence-of-natural-sentences
论文：http：//papers.nips.cc/paper/5776-expressing-an-image-stream-with-a-sequence-of-natural-sentences.pdf
论文：http：//www.cs.cmu.edu/~gunhee/publish/nips15_stream2text.pdf
作者页面：http：//www.cs.cmu.edu/~gunhee/
github：https：//github.com/cesc-park/CRCN

图像标题翻译的多模态支点

简介：ACL 2016
arxiv：http：//arxiv.org/abs/1601.03916

使用深度双向LSTM的图像字幕

简介：ACMMM 2016
arxiv：http：//arxiv.org/abs/1604.00790
github（Caffe）：https：//github.com/deepsemantic/image_captioning
演示：https：//youtu.be/a0bh9_2LE24

编码，检查和解码：用于生成标题的审阅者模块

查看网络的标题生成

简介：NIPS 2016
arxiv：https：//arxiv.org/abs/1605.07912
github：https：//github.com/kimiyoung/review_net

神经图像标题中的注意正确性

arxiv：http：//arxiv.org/abs/1605.09553

具有文本条件语义注意的图像标题生成

arxiv：https：//arxiv.org/abs/1606.04621
github：https：//github.com/LuoweiZhou/e2e-gLSTM-sc

DeepDiary：终身图像流的自动字幕生成

简介：ECCV国际自我中心感知，互动和计算研讨会
arxiv：http：//arxiv.org/abs/1608.03819

phi-LSTM：基于短语的图像字幕分层LSTM模型

简介：ACCV 2016
arxiv：http：//arxiv.org/abs/1608.05813

用不同的对象标题图像

arxiv：http：//arxiv.org/abs/1606.07770

学习在图像理解中推广到新的作品

arxiv：http：//arxiv.org/abs/1608.07639

生成字幕而不超出对象

简介：ECCV2016第二届图像和视频讲故事研讨会（VisStory）
arxiv：https：//arxiv.org/abs/1610.03708

SPICE：语义命题图像标题评估

简介：ECCV 2016
项目页面：http：//www.panderson.me/spice/
论文：http：//www.panderson.me/images/SPICE.pdf
github：https：//github.com/peteanderson80/SPICE

使用属性提升图像标题

arxiv：https：//arxiv.org/abs/1611.01646

Bootstrap，Review，Decode：使用域外文本数据来改进图像字幕

arxiv：https：//arxiv.org/abs/1611.05321

一种生成描述性图像段落的分层方法

简介：斯坦福大学
arxiv：https：//arxiv.org/abs/1611.06607

联合推理和视觉语境的密集字幕

简介：Snap Inc.
arxiv：https：//arxiv.org/abs/1611.06949

使用策略渐变方法优化图像描述指标

简介：牛津大学和谷歌
arxiv：https：//arxiv.org/abs/1612.00370

图像标题的注意区域

arxiv：https：//arxiv.org/abs/1612.01033

知道何时看：通过Visual Sentinel进行图像捕获的自适应注意

简介：CVPR 2017
arxiv：https：//arxiv.org/abs/1612.01887
github：https：//github.com/jiasenlu/AdaptiveAttention

循环图像捕获器：使用空间不变变换和注意过滤描述图像

arxiv：https：//arxiv.org/abs/1612.04949

具有语言CNN的经常性公路网络用于图像捕获

arxiv：https：//arxiv.org/abs/1612.07086

由字幕引导的自上而下的视觉显着性

arxiv：https：//arxiv.org/abs/1612.07360
github：https：//github.com/VisionLearningGroup/caption-guided-saliency

MAT：用于图像捕获的多模态注意转换器

https://arxiv.org/abs/1702.05658

基于深度强化学习的嵌入奖励图像标题

简介：Snap Inc＆Google Inc
arxiv：https：//arxiv.org/abs/1704.03899

参加你：使用上下文序列存储网络的个性化图像字幕

简介：CVPR 2017
arxiv：https：//arxiv.org/abs/1704.06485
github：https：//github.com/cesc-park/attend2u

Punny Captions：图像描述中的Witty Wordplay

https://arxiv.org/abs/1704.08224

展示，改编和讲述：跨领域图像捕获者的对抗性训练

https://arxiv.org/abs/1705.00930

图像标题的演员 - 评论家序列训练

简介：伦敦大学玛丽皇后学院和杨氏会计咨询有限公司
关键词：演员评论强化学习
arxiv：https：//arxiv.org/abs/1706.09601

回归神经网络（RNN）在图像标题生成器中的作用是什么？

简介：第十届自然语言世界国际会议论文集（INLG'17）
arxiv：https：//arxiv.org/abs/1708.02043

Stack-Captioning：用于图像字幕的粗到细学习

https://arxiv.org/abs/1709.03376

自导多模LSTM - 当我们没有完美的图像字幕训练数据集时

https://arxiv.org/abs/1709.05038

图像标题的对比学习

简介：NIPS 2017
arxiv：https：//arxiv.org/abs/1710.02534

基于分层LSTM模型的基于短语的图像标题

简介：ACCV2016扩展，基于短语的图像字幕
arxiv：https：//arxiv.org/abs/1711.05557

卷积图像标题

https://arxiv.org/abs/1711.09151

显示和傻瓜：制作神经图像字幕的对抗性示例

https://arxiv.org/abs/1712.02051

利用对偶语义对齐改进图像标题

简介：IBM Research
arxiv：https：//arxiv.org/abs/1805.00063

对象计数！将显式检测带回图像标题

简介：NAACL 2018
arxiv：https：//arxiv.org/abs/1805.00314

去除挫败的图像标题

简介：NAACL 2018
arxiv：https：//arxiv.org/abs/1805.06549

SemStyle：学习使用未对齐文本生成程式化图像标题

简介：CVPR 2018
arxiv：https：//arxiv.org/abs/1805.07030

用条件生成对抗网改进图像标题

https://arxiv.org/abs/1805.07112

CNN + CNN：用于图像捕获的卷积解码器

https://arxiv.org/abs/1805.09019

具有词性引导的多样且可控的图像标题

https://arxiv.org/abs/1805.12589

学习评估图像标题

简介：CVPR 2018
arxiv：https：//arxiv.org/abs/1806.06422

主题引导注意图像标题

简介：ICIP 2018
arxiv：https：//arxiv.org/abs/1807.03514

用于序列级图像捕获的上下文感知可视策略网络

简介：ACM MM 2018口服
arxiv：https：//arxiv.org/abs/1808.05864
github：https：//github.com/daqingliu/CAVP

探索图像标题的视觉关系

简介：ECCV 2018
arxiv：https：//arxiv.org/abs/1809.07041

图像字幕作为SOCKEYE中的神经机器翻译任务

https://arxiv.org/abs/1810.04101

无监督的图像标题

https://arxiv.org/abs/1811.10787

对象描述

无歧义对象描述的生成与理解

arxiv：https：//arxiv.org/abs/1511.02283
github：https：//github.com/mjhucla/Google_Refexp_toolbox

视频字幕/说明

在统一框架中联合建模深度视频和合成文本以桥接视觉和语言

简介：AAAI 2015
论文：http：//www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Pan_Jointly_Modeling_Embedding_CVPR_2016_paper.pdf
论文：http：//web.eecs.umich.edu/~jjcorso/pubs/xu_corso_AAAI2015_v2t.pdf

使用深度递归神经网络将视频转换为自然语言

介绍：NAACL-HLT 2015相机准备就绪
项目页面：https：//www.cs.utexas.edu/~vsub/naacl15_project.html
arxiv：http：//arxiv.org/abs/1412.4729
幻灯片：https：//www.cs.utexas.edu/~vsub/pdf/Translating_Videos_slides.pdf
代码+数据：https：//www.cs.utexas.edu/~vsub/naacl15_project.html#code

利用时间结构描述视频

arxiv：http：//arxiv.org/abs/1502.08029
github：https：//github.com/yaoli/arctic-capgen-vid

SA-tensorflow：用于生成视频字幕的软注意机制

github：https：//github.com/tsenghungchen/SA-tensorflow

顺序到序列 - 视频到文本

简介：ICCV 2015. S2VT
项目页面：http：//vsubhashini.github.io/s2vt.html
arxiv：http：//arxiv.org/abs/1505.00487
幻灯片：https：//www.cs.utexas.edu/~vsub/pdf/S2VT_slides.pdf
github（Caffe）：https：//github.com/vsubhashini/caffe/tree/recurrent/examples/s2vt
github（TensorFlow）：https：//github.com/jazzsaxmafia/video_to_sequence

嵌入式翻译与桥梁视频与语言的联合建模

arxiv：http：//arxiv.org/abs/1505.01861

使用双向递归神经网络的视频描述

arxiv：http：//arxiv.org/abs/1604.03390

视频描述的双向长短期记忆

arxiv：https：//arxiv.org/abs/1606.04631

使用人工智能自动翻译和标题视频的3种方法

博客：http：//photography.tutsplus.com/tutorials/3-ways-to-subtitle-and-caption-your-videos-automatically-using-artificial-intelligence-cms-26834

用于视频字幕生成的帧级和段级功能以及候选池评估

arxiv：http：//arxiv.org/abs/1608.04959

图像和视频的自然语言描述的接地和生成

简介：Anna Rohrbach。艾伦人工智能研究所（AI2）
youtube：https：//www.youtube.com/watch？v = fE3FX8FowiU

具有语义注意的视频字幕和检索模型

简介：LSMDC 2016挑战赛的四项任务中的三项（填空，多项选择测试和电影检索）获胜者（ECCV 2016研讨会）
arxiv：https：//arxiv.org/abs/1610.02947

接地视频字幕的时空注意模型

arxiv：https：//arxiv.org/abs/1610.04997

视频和语言：通过深度学习桥接视频和语言

介绍：ECCV-MM 2016.字幕，评论，对齐
幻灯片：https：//www.microsoft.com/en-us/research/wp-content/uploads/2016/10/Video-and-Language-ECCV-MM-2016-Tao-Mei-Pub.pdf

用于描述视频的循环内存寻址

arxiv：https：//arxiv.org/abs/1611.06492

具有传输语义属性的视频字幕

arxiv：https：//arxiv.org/abs/1611.07675

用于将视频转换为语言的自适应特征提取

arxiv：https：//arxiv.org/abs/1611.07837

视觉字幕的语义组合网络

简介：CVPR 2017.杜克大学和清华大学及MSR
arxiv：https：//arxiv.org/abs/1611.08002
github：https：//github.com/zhegan27/SCN_for_video_captioning

用于视频字幕的分层边界感知神经编码器

arxiv：https：//arxiv.org/abs/1611.09312

基于注意力的多模融合视频描述

arxiv：https：//arxiv.org/abs/1701.03126

弱监督密集视频字幕

简介：CVPR 2017
arxiv：https：//arxiv.org/abs/1704.01502

用基础和共同参考的人生成描述

简介：CVPR 2017.电影描述
arxiv：https：//arxiv.org/abs/1704.01518

具有视频和蕴涵生成的多任务视频字幕

简介：ACL 2017. UNC教堂山
arxiv：https：//arxiv.org/abs/1704.07489

视频中的密集字幕事件

项目页面：http：//cs.stanford.edu/people/ranjaykrishna/densevid/
arxiv：https：//arxiv.org/abs/1705.00754

具有视频字幕调整时间注意的分层LSTM

https://arxiv.org/abs/1706.01231

强化视频字幕与蕴涵奖励

简介：EMNLP 2017. UNC教堂山
arxiv：https：//arxiv.org/abs/1708.02300

用于视频字幕，检索和问题回答的端到端概念词检测

简介：CVPR 2017.在LSMDC 2016挑战赛的四项任务中，获得三项（填空，多项选择测试和电影检索）
arxiv：https：//arxiv.org/abs/1610.02947
幻灯片：https：//drive.google.com/file/d/0B9nOObAFqKC9aHl2VWJVNFp1bFk/view

从确定性到生成性：用于视频字幕的多模态随机RNN

https://arxiv.org/abs/1708.02478

用于视频字幕的接地对象和交互

https://arxiv.org/abs/1711.06354

集成视觉和音频提示以增强视频标题

https://arxiv.org/abs/1711.08097

通过分层强化学习的视频字幕

https://arxiv.org/abs/1711.11135

基于共识的视频字幕序列训练

https://arxiv.org/abs/1712.09532

少即是多：为视频字幕选择信息框架

https://arxiv.org/abs/1803.01457

多任务强化学习的端到端视频字幕

https://arxiv.org/abs/1803.07950

具有屏蔽变压器的端到端密集视频字幕

简介：CVPR 2018.密歇根大学和Salesforce Research
arxiv：https：//arxiv.org/abs/1804.00819

视频字幕重建网络

简介：CVPR 2018
arxiv：https：//arxiv.org/abs/1803.11438

用于密集视频字幕的双向注入融合与上下文门控

简介：CVPR 2018聚光灯纸
arxiv：https：//arxiv.org/abs/1804.00100

联合本地化和描述密集视频字幕的事件

简介：CVPR 2018 Spotlight，2017年ActivityNet Captions Challenge排名第1
arxiv：https：//arxiv.org/abs/1804.08274

语境化，显示和讲述：神经视觉讲故事者

https://arxiv.org/abs/1806.00738

RUC + CMU：视频中密集字幕事件的系统报告

简介：ActivityNet 2018密集视频字幕挑战中的获胜者
arxiv：https：//arxiv.org/abs/1806.08854

项目

学习用于图像标题生成的CNN-LSTM架构：CNN-LSTM图像标题生成器架构的实现，其在MSCOCO数据集上实现接近最先进的结果。

github：https：//github.com/mosessoh/CNN-LSTM-Caption-Generator

screengrab-caption：一个openframeworks应用程序，用神经网络为你的桌面屏幕加上字幕

介绍：openframeworks应用程序，它抓取您的桌面屏幕，然后将其发送到暗网以进行字幕。适用于视频通话。
github：https：//github.com/genekogan/screengrab-caption

工具

CaptionBot（微软）

网站：https：//www.captionbot.ai/

SceneGraph以及ImageCaption文章项目整理

展示并演讲

显示，出席和讲述

对象描述

视频字幕/说明

项目

工具

猜你喜欢