【24h】

Jointly Learning to See, Ask, and Guess

机译:共同学习看,问和猜

获取原文

摘要

In our daily use of natural language, we constantly profit of our strong reasoning skills to interpret utterances we hear or read. At times we exploit implicit associations we have learned between words or between events, at others we explicitly think about a problem and follow the reasoning steps carefully and slowly. We could say that the latter are the realm of logical approaches based on symbolic representations, whereas the former are better modelled by statistical models, like Neural Networks (NNs), based on continuous representations. My talk will focus on how NNs can learn to be engaged in a conversation on visual content. Specifically, I will present our work on Visual Dialogue (VD) taking as example two tasks oriented VDs, GuessWhat?! [2] and GuessWhich [1]. In these tasks, two NN agents interact to each other so that one of the two (the Questioner), by asking questions to the other (the Answerer), can guess which object the Answerer has in mind among all the entities in a given image (GuessWhat?!) or which image the Answerer sees among several ones seen by the Questioner at end of the dialogue (GuessWhich). I will present our Questioner model: it encodes both visual and textual inputs, produces a multimodal representation, generates natural language questions, understands the Answerer's responses and guesses the object/image. I will show how training the NN agent's modules (Question generator and Guesser) jointly and cooperatively help the model performance and increase the quality of the dialogues. In particular, I will compare our model's dialogues with those of VD models which exploit much more complex learning paradigms, like Reinforcement Learning, showing that more complex machine learning methods do not necessarily correspond to better dialogue quality or even better quantitative performance. The talk is based on [3] and other work available at https://vista-unitn-uva.github.io/.
机译:在日常使用的自然语言中,我们不断受益于强大的推理能力来解释我们听到或听到的话语。有时,我们利用在单词之间或事件之间学习到的隐式关联,在其他情况下,我们明确地思考问题并仔细,缓慢地遵循推理步骤。我们可以说后者是基于符号表示的逻辑方法的领域,而前者则可以通过基于连续表示的统计模型(如神经网络,NN)更好地建模。我的演讲将重点讨论NN如何学习参与视觉内容的对话。具体来说,我将以视觉对话(VD)为例介绍两个面向任务的VD,GuessWhat ?! [2]和GuessWhich [1]。在这些任务中,两个NN代理相互交互,以便通过向另一个(Answerer)提问,两个(Amitter)中的一个可以猜测给定图像中所有实体中Answerer想到的是哪个对象。 (猜猜是什么?!)或在对话结束时,发问者看到的几幅图像(猜猜是哪个)。我将介绍我们的Questioner模型:它对视觉和文本输入进行编码,生成多模式表示,生成自然语言问题,了解Answerer的响应并猜测对象/图像。我将展示如何训练NN代理程序的模块(问题生成器和Guesser)来联合和协作地帮助模型性能和提高对话质量。特别是,我将把我们的模型的对话与利用更复杂的学习范例(例如强化学习)的VD模型的对话进行比较,这表明更复杂的机器学习方法不一定对应于更好的对话质量甚至更好的量化性能。该演讲基于[3]和其他可在https://vista-unitn-uva.github.io/上找到的工作。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号