VLSP 2023 challenge on Visual Reading Comprehension for Vietnamese

Shared Task Registration Form

Important Dates
(Timezone: UTC+7)

  • August 14, 2023: Registration open
  • September 04, 2023: Trial data release
  • September 11, 2023: Public test
  • October 09, 2023: Private test 
  • October 12, 2023: Final results on the private test. Competition End. 
  • November 26, 2023: Technical report submission
  • December 15-16, 2023: Presentation at VLSP 2023 workshop 

(The top 5 teams are required to submit a paper to VLSP 2023 to get their achievement acknowledged. If any top teams did not submit their papers, follow-up teams can submit and take their places.)

Task Description

Visual Question Answering (VQA) [1] is a challenging task that has gradually gained attraction and made substantial progress in recent years. Recently, researchers concern the reading comprehension ability of VQA models, that is, the VQA models can use information from scene texts to have more information for answering the given question. This task is challenging and interesting as it requires the close-to human ability in visual understanding from both scenery and textual information.

In this task, we provide the participants with the OpenViVQA dataset [2]. This dataset contains images collected in Vietnam and question-answer pairs constructed from crowd-sourcing. Questions in the OpenViVQA dataset not only query the scenery information but also the textual information available in images. Hence we believe this task is interesting and deserving of the participation of researchers and engineers within the domains of Natural Language Processing and Computer Vision.

Dataset Information

The OpenViVQA dataset with over 11,000+ images associated with 37,000+ question-answer pairs in Vietnam is provided to participating teams. The dataset is stored as.json files. Several examples of the VLSP2023 - ViRC task in VLSP 2023 are shown below.

Note: The dataset will be sent to the participating teams via email.  

Evaluation Metrics

Two evaluation metrics: BLEU and CIDEr [4] are used for this challenge. In particular, BLEU is the average score of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 [3] as the evaluation metric for visual question answering. The final ranking is evaluated on the test set according to CIDEr (BLEU as the other metric when there is a tie). 

Baseline System
We provide a baseline method,  to compare with others from the participating teams.

Terms

  • All teams must provide pre-trained embedding, pre-trained language models, pre-trained image models, and pre-trained vision-language models that you use in this contest by September 11, 2023, and do not use any external resources related to visual question answering for training VQA models except for data provided by organizers. If you use any pre-trained models that are not on the list provided by the participating teams or use external resources related to visual question answering, the final result is not accepted.
  • The top-3 teams may be required to provide source code to examine the final results.
  • Private test phase: The teams must submit at most three answer prediction files of your selected models. The final result of each team is based on the highest score among the prediction files.

Registration for pre-trained embedding and pre-trained language models

A list of pre-trained embedding and pre-trained language models will be provided to all teams by September 04, 2023.

Submission System

All phases of the competition on the system: https://codalab.lisn.upsaclay.fr/competitions/15212. Note: Each team can only use 1 account on the submission system.

Organizers

Ngan Luu-Thuy Nguyen, Kiet Van Nguyen, Nghia Hieu Nguyen, Khiem Vinh Tran, Huy Quoc To, University of Information Technology, VNU-HCM

Contact Us
Please feel free to contact us if you need any further information: vlsp2023.vivrc@gmail.com

References
[1] S. Antol et al., ‘VQA: Visual Question Answering’, in International Conference on Computer Vision (ICCV), 2015.
[2] N. H. Nguyen, D. T. D. Vo, K. Van Nguyen, and N. L.-T. Nguyen, ‘OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual Question Answering in Vietnamese’, Information Fusion, Vol. 100, December 2023.
[3] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, ‘Bleu: a Method for Automatic Evaluation of Machine Translation’, in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002, pp. 311–318.
[4] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, ‘Cider: Consensus-based image description evaluation’, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4566–4575.
[5]. R. Hu, A. Singh, T. Darrell, and M. Rohrbach. Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9992–10002, 2020.