AnswerNet: Learning to Answer Questions
Document Type
Article
Date of Original Version
12-1-2019
Abstract
Multi-modal tasks like visual question answering (VQA) are an important step towards human-level artificial intelligence. In general, the input of the VQA task consists of an image and a related question. In order to correctly answer the question, a model needs to extract and integrate useful information from both the image and the question. In this paper, we propose a model named AnswerNet to tackle this task. In the proposed model, discriminative features are extracted from both the image and the question. Specifically, high-level image features are extracted by the state-of-the-art convolutional neural network, i.e., Deep Residual Net. For question features, the semantic representations of the question and the term frequencies of the distinct words are captured by long short-term memory network and bag-of-words model, respectively. Then, a hierarchical fusion network is proposed to effectively fuse the image features with the question features. Experimental results on three large-scale datasets, VQA, COCO-QA, and VQA2, demonstrate the effectiveness of the proposed AnswerNet.
Publication Title, e.g., Journal
IEEE Transactions on Big Data
Volume
5
Issue
4
Citation/Publisher Attribution
Wan, Zhiqiang, and Haibo He. "AnswerNet: Learning to Answer Questions." IEEE Transactions on Big Data 5, 4 (2019). doi: 10.1109/TBDATA.2018.2884486.