Seeing and Reasoning:A Simple Deep Learning Approach to Visual Question Answering-杂志社官网-杂志社编辑部投稿-知库期刊数据库

Seeing and Reasoning:A Simple Deep Learning Approach to Visual Question Answering

在线投稿下载

简介：Visual Question Answering(VQA)is a complex task that requires a deep understanding of both visual content and natural language questions.The challenge lies in enabling models to recognize and interpret visual elements and to reason through questions in a multi-step,compositional manner.We propose a novel Transformer-based model that introduces specialized tokenization techniques to effectively capture intricate relationships between visual and textual features.The model employs an enhanced self-attention mechanism,enabling it to attend to multiple modalities simultaneously,while a co-attention unit dynamically guides focus to the most relevant image regions and question components.Additionally,a multi-step reasoning module supports iterative inference,allowing the model to excel at complex reasoning tasks.Extensive experiments on benchmark datasets demonstrate the model's superior performance,with accuracies of 98.6％on CLEVR,63.78％on GQA,and 68.67％on VQA v2.0.Ablation studies confirm the critical contribution of key components,such as the reasoning module and co-attention mechanism,to the model's effectiveness.Qualitative analysis of the learned attention distributions further illustrates the model's dynamic reasoning process,adapting to task complexity.Overall,our study advances the adaptation of Transformer architectures for VQA,enhancing both reasoning capabilities and model interpretability in visual reasoning tasks.展开

学者：RufaiYusufZakariJimWilsonOwusuKeQinTaoHeGuangchunLuo

关键词：Machine learningDeep learningvisual question answering(VQA)multi-step reasoningcomputer vision

分类号：

资助基金：

论文发表日期：

在线出版日期：2025-06-13 （网站首发日期）

页数：21(458-478)

大数据挖掘与分析（英文版） 2026年2期

ISSN：

年,卷(期)：2025,8(2)

所属栏目：哲学

评审资料打包下载