VLSP 2022 – EVJVQA Challenge: Multilingual Visual Question Answering

Shared Task Registration Form

Important Dates

  • July 27, 2022: Registration opens
  • October 1, 2022: Trial Data. 
  • October 5, 2022: Public Test. 
  • October 25, 2022 (November 10, 2022): Private Test. 
  • October 27, 2022 (November 12, 2022): Final results on the private test. Competition End. 
  • November 20, 2022: Paper submission due. The top 5 teams are required to submit a paper to VLSP 2022 to get their achievement acknowledged. If any top teams did not submit their papers, follow-up teams can submit and take their places. 
  • November 26, 2022: Presentation at VLSP 2022 workshop. 

Task Description

Multilingual Visual Question Answering (mVQA) is a challenging task that has gradually gained attraction and made substantial progress in recent years. mVQA is also one of the potential tasks with a combination of Computational Linguistics and Computer Vision. Based on an image and a question about it, an mVQA system can predict correct answers in several languages. Although the task is simple for humans, it is a challenge for computers. 

Samples extracted from UIT-EVJVQA

UIT-EVJVQA, the first multilingual Visual Question Answering dataset with three languages: English, Vietnamese, and Japanese, is released in this task. UIT-EVJVQA includes question-answer pairs created by humans on a set of images taken in Vietnam, with the answer created from the input question and the corresponding image. UIT-EVJVQA consists of about 30K question-answer pairs for evaluating the mQA models. To perform effectively in UIT-EVJVQA, the VQA systems must not only answer monolingual questions but also identify and predict correct answers for multilingual questions. Participating teams will utilize the UIT-EVJVQA dataset to evaluate visual question-answer models in this task.

Dataset Information

UIT-EVJVQA dataset with over 30K question-answer pairs on approximately 5,000 images taken in Vietnam is provided to participating teams. The dataset is stored as.json files. Several examples of the multilingual Visual Question Answering task in VLSP 2022 are shown below.

Note: The dataset will be sent to the participating teams via email.  

Evaluation Metrics

Two evaluation metrics: F1 and BLEU are used for this challenge. In particular, BLEU is the average score of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 as the evaluation metric for visual question answering. The final ranking is evaluated on the test set according to F1 (BLUE as a secondary metric when there is a tie). 

Baseline System

We provide a simple baseline system (ViT + mBERT) to compare with others from the participating teams.


  • All teams must provide pre-trained embedding, pre-trained language models, pre-trained image models, and pre-trained vision-language models that you use in this contest before Oct 15, 2022, and do not use any external resources related to visual question answering for training VQA models except for data provided by organizers. If you use any pre-trained models that are not on the list provided by the participating teams or use external resources related to visual question answering, the final result is not accepted.
  • The top 3 teams may be required to provide source code to examine the final results.
  • Private test phase: The teams must submit at most three answer prediction files of your selected models. The final result of each team is based on the highest score among the prediction files.

Registration for pre-trained embedding and pre-trained language models

Please fill out this form by October 15, 2022. A list of pre-trained embedding and pre-trained language models will be provided to all teams by October 17, 2022.

Submission System

All phases of the competition on the system: https://aihub.vn/. Note: Each team can only use 1 account on the submission system.


Ngan Luu-Thuy Nguyen, Kiet Van Nguyen, Tin Van Huynh, Khanh Quoc Tran, Nghia Hieu Nguyen, Duong T. D. Vo

Contact Us

Please feel free to contact us if you need any further information: evjvqa.vlsp2022@gmail.com or kietnv@uit.edu.vn


[1] Goyal, Yash, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. "Making the v in vqa matter: Elevating the role of image understanding in visual question answering." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6904-6913. 2017. 

[2] Tran, Khanh Quoc, and Nguyen, An Trong and Le, An Tran-Hoai and Nguyen, Kiet Van, "ViVQA: Vietnamese Visual Question Answering," in Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation, 2021. 

[3] Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Dollar, Piotr and Zitnick, C Lawrence, "Microsoft coco: Common objects in context," in European conference on computer vision, 2014.