VLSP 2022 - Machine Translation | Association for Vietnamese Language and Speech Processing

Shared Task Registration Form

Important dates

July 27, 2022: Registration open
Aug 31, 2022: Registration closes
Oct 09, 2021: Training data and dev/test data released
Nov 04, 2021: System submission deadline (docker)
Nov 10, 2022: Technical report submission
Nov 26, 2022: Result announcement (workshop day)

Introduction

Two years ago, the Shared Task on Machine Translation in VLSP Evaluation campaign marked a successful comeback of Machine Translation in VLSP activities with a handful of teams competing on English-Vietnamese Translation over the COVID-19 News. The organizers would like to extend it this year in the hope that we could attract more interest from academic institutes and industry, both from inland Vietnam and overseas. Our ultimate goal is to create a strong research community, boosting the research on Machine Translation and its related directions. We also welcome research groups to improve the methods and bring their systems into the real-world scenario. Further domestic and international collaborations in the field are included in our goals.

Task Description

We will feature two tasks this year:

Chinese-Vietnamese Machine Translations (Chinese - Vietnamese and Vietnamese - Chinese): The participants need to deal with a scenario in which we do not have much data. Furthermore, since Chinese and Vietnamese can be considered similar languages to some degree (e.g., many Sino Vietnamese words have a 1-to-1 mapping in meaning with their Chinese counterpart), participants could employ unique methods for similar language pairs in this task.

Evaluation

Results would be ranked by human evaluation. Participants can submit constrained and unconstrained systems. Constrained systems are the systems developed by participants and trained on the data provided by the organizers. Unconstrained systems include the systems which use commercial translation products developed by people rather than the participants (e.g., Systran products), the systems which use software developed by others playing the main part of the translation process (e.g., Google Translate), and the systems which use the data not to be provided by the workshop. Only constrained systems will be evaluated and ranked. Unconstrained systems would not be human-evaluated and ranked. You can, however, use other data and systems (for example, a multilingual system using English, Chinese and Vietnamese data or even other data) to demonstrate the significant improvements employing large data and report them in your system paper.
The details on how to submit your systems will be informed later.

Training and Test Data

Parallel Corpora:
- Chinese-Vietnamese: TBA.
Monolingual Corpora:
- Chinese-Vietnamese: TBA.
Development set and (public) test set:
- Chinese-Vietnamese: TBA.

The development set and (public) test set will be provided with the training sets. Participants could facilitate those datasets while training to validate their models before applying them to the official (private) test set, which will be provided on the planned date. You can safely assume that the development set, the test set, and the official test set are in the same domain. Participants should use the public test set with an automatic metric to decide which systems to submit. We suggest that you could use the BLEU score (Papineni et al., 2002) implemented in SacreBLEU.

Official (private) test set:
• Will be posted here and informed via VLSP mailing list.

Data Format and Training Data

Input format:

For the parallel data, training, development and test sets will be provided as UTF-8 plaintexts, 1-to-1 sentence aligned, one “sentence” per line. Notice that “sentence” here is not necessarily a linguistic sentence but maybe phrases.
For the monolingual corpora, we provide UTF-8 plaintexts, one “sentence” per line as you would see when you downloaded them

Output format:

UTF-8, precomposed Unicode plaintexts, one sentence per line. Participants might choose appropriate casing methods in the preprocessing steps: word segmentation, true casing, lowercasing or leaving it all along. You might want to use those tools which are available in the Moses git repository.

Submission
Participants must submit a working Docker image that satisfies the following constraint:

Self contained – the image contains all your model and its dependency, and must work offline. Do not use any online service/API in your code
Accompanied by a Bash script that uploads an input text file to your Docker image and receives the corresponding translations in an output text file. This bash script will receive (1) the host: port web path of your Docker image, (2) the path to the input file, and (3) the path to the output file as arguments. Additionally, participants are encouraged to additional output statistics to standard output, such as total run time, averaged sentences-per-second and words-per-second, etc.
Compressed in a known format: .tar.gz | .tar.bz2 | .7z | .rar | .zip
Provided a MD5 checksum for integrity verification.
The compressed file is to be hosted in known cloud storage service (e.g Google Drive, Microsoft OneDrive) and given appropriate download permission.
Multiple submissions are allowed, but only the last submission will be evaluated.

Organizers

Van-Vinh Nguyen (vinhnv@vnu.edu.vn)
Le-Minh Nguyen (nguyenml@jaist.ac.jp)
Hong-Viet Tran (vietcdcn1@gmail.com)

Association for Vietnamese Language and Speech Processing

Search

VLSP 2022 - Machine Translation