EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering

CVPR 2025
Hefei University of Technology1, National University of Singapore2, University of Science and Technology of China3

Abstract

We introduce EgoTextVQA, a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text. EgoTextVQA contains 1.5K ego-view videos and 7K scene-text aware questions that reflect real-user needs in outdoor driving and indoor house-keeping activities. The questions are designed to elicit identification and reasoning on scene text in an egocentric and dynamic environment. With EgoTextVQA, we comprehensively evaluate 10 prominent multimodal large language models. Currently, all models struggle, and the best results (Gemini 1.5 Pro) are around 33% accuracy, highlighting the severe deficiency of these techniques in egocentric QA assistance. Our further investigations suggest that precise temporal grounding and multi-frame reasoning, along with high resolution and auxiliary scene-text inputs, are key for better performance. With thorough analyses and heuristic suggestions, we hope EgoTextVQA can serve as a solid testbed for research in egocentric scene-text QA assistance.

EgoTextVQA

EgoTextVQA is a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text. EgoTextVQA contains 1.5K ego-view videos and 7K scene-text aware questions that reflect real-user needs in outdoor driving and indoor house-keeping activities. The questions are designed to elicit identification and reasoning on scene text in an egocentric and dynamic environment. It consists of two parts: 1) EgoTextVQA-Indoor focuses on the outdoor scenarios, with 694 videos and 4,848 QA pairs that may arise when driving; 2) EgoTextVQA-Outdoor emphasizes indoor scenarios, with 813 videos and 2,216 QA pairs that users may encounter in house-keeping activities. There are several unique features of EgoTextVQA.

  • It stands out as the first VideoQA testbed towards egocentric scene-text aware QA assistance in the wild, with 7K QAs that reflect diverse user intentions under 1.5K different egocentric visual situations.
  • The QAs emphasize scene text comprehension, but only about half invoke the exact scene text.
  • The situations cover both indoor and outdoor activities.
  • Detailed timestamps and categories of the questions are provided to facilitate real-time QA and model analysis.
  • The real-time QA setting is that the answer to the question is obtained from the video captured before the question is asked, rather than the global content. The answer changes with the timestamp of the question.
  • High and low video resolution settings can be used to evaluate the scene text reading capabilities of MLLMs. In EgoTextVQA, we evaluate the model in both high-resolution (1920×1080, 1280×720) and low-resolution (960×540, 640×360) video settings.

Dataset comparision

benchmark category

QA examples of different question categories

benchmark category

Dataset analysis

benchmark category

Dataset Examples

Examples on EgoTextVQA-Outdoor

benchmark category

Examples on EgoTextVQA-Indoor

benchmark category

Experiment Results

Evaluation results of MLLMs on EgoTextVQA-Outdoor with low resolution (960×540, 640×360)

evaluation

Evaluation results of MLLMs on EgoTextVQA-Indoor with resolution (640×360, 480×360)

evaluation

Evaluation results of MLLMs on the real-time QA subset of EgoTextVQA-Outdoor (∼623 QA pairs)

evaluation

Evaluation results of MLLMs on EgoTextVQA-Outdoor with high resolution (1920×1080, 1280×720)

evaluation

BibTeX

@article{zhou2025egotextvqa,
      title={EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering}, 
      author={Sheng Zhou and Junbin Xiao and Qingyun Li and Yicong Li and Xun Yang and Dan Guo and Meng Wang and Tat-Seng Chua and Angela Yao},
      journal={arXiv preprint arXiv:2502.07411},
      year={2025}
}

Acknowledgments

We would like to thank the following repos for their great work: