VLSP 2021 - Machine Translation

Important dates

  • Aug 5, 2021: Registration open 
  • Oct 09, 2021: Training data and dev/test data released
  • Nov 02, 2021: Official test set released
  • Nov 04, 2021: System submission deadline
  • Nov 15, 2021: Technical report submission
  • Nov 26, 2021: Result announcement (workshop day)

Introduction

The Shared Task on Machine Translation in VLSP Evaluation campaign last year marks a successful comeback of Machine Translation in VLSP activities with a handful of teams competing on English-Vietnamese Translation over the COVID-19 News. The organizers would like to extend it this year in the hope that we could attract more interest from academic institutes and industry, both from inland Vietnam and overseas. Our ultimate goal is to create a strong research community, boosting the research on Machine Translation and its related directions. We also welcome research groups to improve the methods and bring their systems into the real-world scenario. Further domestic and international collaborations on the field are included in our goals.

Task Description

We will feature two tasks this year:

  1. English to Vietnamese Machine Translation: Similar to last year, we would like to ask teams/individuals to participate in translation texts from English to Vietnamese in the news domain. However, unlike last year, we will provide much bigger (and maybe noisy) datasets. The participants need to design and apply appropriate data filtering and preprocessing techniques to get the best out of the provided datasets. 
  2. Chinese to Vietnamese Machine Translation: This year we provide this translation direction as a low-resource translation task. The participants need to deal with a scenario in which we do not have much data. Furthermore, since Chinese and Vietnamese can be considered as similar languages to some degree (e.g. many Sino Vietnamese words have an 1-to-1 mapping in meaning with its Chinese counterpart), participants could employ special methods for similar language pairs in this task.

Evaluation

Results would be ranked by human evaluation. Participants can submit constrained and unconstrained systems. Constrained systems are the systems that have been developed by participants and  trained on the data provided by the organizers. Unconstrained systems include the systems which use commercial translation products developed by people rather than the participants (e.g. Systran products), the systems which use softwares developed by others playing the main part of the translation process (e.g. Google Translate) and the systems which use the data not to be provided by the workshop. Only constrained systems will be evaluated and ranked. Unconstrained systems would not be human-evaluated and ranked. You can, however, use other data and systems (for example a multilingual system using English, Chinese and Vietnamese data or even other data) to demonstrate the significant improvements by means of large data and report them in your system paper.

The details on how to submit your systems will be informed later.

Training and Test Data

  • Parallel Corpora:
    • English-Vietnamese: TBA.
    • Chinese-Vietnamese: TBA.
  • Monolingual Corpora:
    • English-Vietnamese: TBA.
    • Chinese-Vietnamese: TBA.
  • Development set and (public) test set:
    • English-Vietnamese: TBA.
    • Chinese-Vietnamese: TBA.

The development set and (public) test set will be provided with the training sets. Participants could facilitate those datasets while training to validate their models before applying them to the official (private) test set which will be provided on the planned date. You can safely assume that the development set, the test set and the official test set are in the same domain. Participants should use the public test set with an automatic metric to decide which systems to submit. We suggest that you could use the BLEU score (Papineni et al., 2002) implemented in SacreBLEU.

  • Official (private) test set:
    • Will be posted here and informed via VLSP mailing list.

Data Format

  • Input format:
    • For the parallel data, training, development and test sets will be provided as UTF-8 plaintexts, 1-to-1 sentence aligned, one “sentence” per line. Notice that “sentence” here is not necessarily a linguistic sentence but maybe phrases.
    • For the monolingual corpora, we provide UTF-8 plaintexts, one “sentence” per line as you would see when you downloaded them.
  • Output format:
    • UTF-8, precomposed Unicode plaintexts, one sentence per line. Participants might choose appropriate casing methods in the preprocessing steps: word segmentation, true casing, lowercasing or leaving it all along. You might want to use those tools which are available in the Moses git repository.

Submission

Multiple run submissions are allowed, but participants must explicitly indicate one PRIMARY run. All other run submissions are treated as CONTRASTIVE runs. In case that none of the runs is marked as PRIMARY, the latest submission (according to the file time-stamp) for the respective track will be used as the PRIMARY run. Only PRIMARY systems are evaluated.

Task Organizers:

  • Van-Vinh Nguyen
  • Le-Minh Nguyen
  • Hong-Viet Tran

Copyrights of the data – Acknowledgment:

TBA.