In our daily use of natural language, we constantly profit of our strong reasoning skills to interpret utterances we hear or read. At times we exploit implicit associations we have learned between words or between events, at others we explicitly think about a problem and follow the reasoning steps carefully and slowly. We could say that the latter are the realm of logical approaches based on symbolic representations, whereas the former are better modelled by statistical models, like Neural Networks (NNs), based on continuous representations. My talk will focus on how NNs can learn to be engaged in a conversation on visual content. Specifically, I will present our work on Visual Dialogue (VD) taking as example two tasks oriented VDs, GuessWhat?! [2] and GuessWhich [1]. In these tasks, two NN agents interact to each other so that one of the two (the Questioner), by asking questions to the other (the Answerer), can guess which object the Answerer has in mind among all the entities in a given image (GuessWhat?!) or which image the Answerer sees among several ones seen by the Questioner at end of the dialogue (GuessWhich). I will present our Questioner model: it encodes both visual and textual inputs, produces a multimodal representation, generates natural language questions, understands the Answerer's responses and guesses the object/image. I will show how training the NN agent's modules (Question generator and Guesser) jointly and cooperatively help the model performance and increase the quality of the dialogues. In particular, I will compare our model's dialogues with those of VD models which exploit much more complex learning paradigms, like Reinforcement Learning, showing that more complex machine learning methods do not necessarily correspond to better dialogue quality or even better quantitative performance. The talk is based on [3] and other work available at https://vista-unitn-uva.github.io/.
展开▼