- July 27, 2022: Registration opens
- October 1, 2022: Trial Data.
- October 5, 2022: Public Test.
October 25, 2022(November 10, 2022): Private Test. October 27, 2022(November 12, 2022): Final results on the private test. Competition End.
- November 20, 2022: Paper submission due. The top 5 teams are required to submit a paper to VLSP 2022 to get their achievement acknowledged. If any top teams did not submit their papers, follow-up teams can submit and take their places.
- November 26, 2022: Presentation at VLSP 2022 workshop.
Multilingual Visual Question Answering (mVQA) is a challenging task that has gradually gained attraction and made substantial progress in recent years. mVQA is also one of the potential tasks with a combination of Computational Linguistics and Computer Vision. Based on an image and a question about it, an mVQA system can predict correct answers in several languages. Although the task is simple for humans, it is a challenge for computers.
UIT-EVJVQA, the first multilingual Visual Question Answering dataset with three languages: English, Vietnamese, and Japanese, is released in this task. UIT-EVJVQA includes question-answer pairs created by humans on a set of images taken in Vietnam, with the answer created from the input question and the corresponding image. UIT-EVJVQA consists of about 30K question-answer pairs for evaluating the mQA models. To perform effectively in UIT-EVJVQA, the VQA systems must not only answer monolingual questions but also identify and predict correct answers for multilingual questions. Participating teams will utilize the UIT-EVJVQA dataset to evaluate visual question-answer models in this task.
UIT-EVJVQA dataset with over 30K question-answer pairs on approximately 5,000 images taken in Vietnam is provided to participating teams. The dataset is stored as.json files. Several examples of the multilingual Visual Question Answering task in VLSP 2022 are shown below.
Note: The dataset will be sent to the participating teams via email.
Two evaluation metrics: F1 and BLEU are used for this challenge. In particular, BLEU is the average score of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 as the evaluation metric for visual question answering. The final ranking is evaluated on the test set according to F1 (BLUE as a secondary metric when there is a tie).
We provide a simple baseline system (ViT + mBERT) to compare with others from the participating teams.
- All teams must provide pre-trained embedding, pre-trained language models, pre-trained image models, and pre-trained vision-language models that you use in this contest before Oct 15, 2022, and do not use any external resources related to visual question answering for training VQA models except for data provided by organizers. If you use any pre-trained models that are not on the list provided by the participating teams or use external resources related to visual question answering, the final result is not accepted.
- The top 3 teams may be required to provide source code to examine the final results.
- Private test phase: The teams must submit at most three answer prediction files of your selected models. The final result of each team is based on the highest score among the prediction files.
Registration for pre-trained embedding and pre-trained language models
Please fill out this form by October 15, 2022. A list of pre-trained embedding and pre-trained language models will be provided to all teams by October 17, 2022.
All phases of the competition on the system: https://aihub.vn/. Note: Each team can only use 1 account on the submission system.
Ngan Luu-Thuy Nguyen, Kiet Van Nguyen, Tin Van Huynh, Khanh Quoc Tran, Nghia Hieu Nguyen, Duong T. D. Vo
Please feel free to contact us if you need any further information: email@example.com or firstname.lastname@example.org
 Goyal, Yash, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. "Making the v in vqa matter: Elevating the role of image understanding in visual question answering." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6904-6913. 2017.
 Tran, Khanh Quoc, and Nguyen, An Trong and Le, An Tran-Hoai and Nguyen, Kiet Van, "ViVQA: Vietnamese Visual Question Answering," in Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation, 2021.
 Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Dollar, Piotr and Zitnick, C Lawrence, "Microsoft coco: Common objects in context," in European conference on computer vision, 2014.