Important dates
(Timezone: UTC+7)
- Aug 14, 2023: Registration open
- Oct 10, 2023: Registration close
- Oct 20, 2023: Training data and dev/public test data release
- Nov 15, 2023: System submission deadline (docker)
- Nov 26, 2023 Technical report submission
- Dec 15-16, 2023: Result announcement - Workshop days
Task Description
In 2022, the shared task on Machine Translation (MT) in the VLSP evaluation campaign marked a successful comeback of Machine Translation in VLSP activities with a handful of teams competing on Chinese-Vietnamese Translation over the News. However, MT coverage for more users speaking diverse languages is limited because the MT methods need vast amounts of parallel data to train quality systems, which has posed a significant obstacle for low-resource translation. Therefore, developing MT systems with relatively small parallel datasets is still highly desirable.
The organizers would like to extend it this year in the hope that we could attract more interest from academic institutes and industry, both from inland Vietnam and overseas. Our ultimate goal is to create a strong research community, boosting the research on Machine Translation and its related directions. We also welcome research groups to improve the methods and bring their systems into the real-world scenario. Further domestic and international collaborations in the field are included in our goals.
We will feature two sub-tasks this year on Lao-Vietnamese Machine Translations (Lao - Vietnamese and Vietnamese - Lao): The participants need to deal with a scenario in which we do not have much data. Furthermore, since Lao and Vietnamese can be considered similar languages to some degree (e.g., many Vietnamese words have a 1-to-1 mapping in meaning with their Lao counterpart), participants could employ unique methods for similar language pairs in this task.
Evaluation
System results would be ranked by human evaluation. Participants can only submit constrained systems. Constrained systems are the systems developed by participants and trained on the data provided by the organizers. Only constrained systems will be evaluated and ranked. You can, however, use other data and systems (for example, a multilingual system using English, Lao and Vietnamese data or even other data) to demonstrate the significant improvements employing large data and report them in your system paper. The details on how to submit your systems will be informed later.
Training and Test Data
- Parallel Corpora: Lao-Vietnamese
- Monolingual Corpora: Lao and Vietnamese
- Development set and (public) test set: Lao-Vietnamese
The development set and (public) test set will be provided together with the training sets. Participants could facilitate those datasets while training to validate their models before applying them to the official (private) test set, which will be provided on the planned date. You can safely assume that the development set, the test set, and the official test set are in the same domain. Participants should use the public test set with an automatic metric to decide which systems to submit. We suggest that you could use SacreBLEU (Post, 2018) for evaluation of your machine system.
Official (private) test set will be posted here and informed via VLSP mailing list.
Data Format
Input format:
- For the parallel data, training, development and public test sets will be provided as UTF-8 plaintexts, 1-to-1 sentence aligned, one “sentence” per line. Notice that “sentence” here is not necessarily a linguistic sentence but maybe phrases.
- For the monolingual corpora, we provide UTF-8 plaintexts, one “sentence” per line as you would see when you downloaded them
Output format:
- UTF-8, precomposed Unicode plaintexts, one sentence per line. Participants might choose appropriate casing methods in the preprocessing steps: word segmentation, true casing, lowercasing or leaving it all along. You might want to use those tools which are available in the Moses git repository.
Submission
Participants must submit a working Docker image that satisfies the following constraint:
- Self contained – the image contains all your model and its dependency, and must work offline. Do not use any online service/API in your code
- Accompanied by a Bash script that uploads an input text file to your Docker image and receives the corresponding translations in an output text file. This bash script will receive (1) the host: port web path of your Docker image, (2) the path to the input file, and (3) the path to the output file as arguments. Additionally, participants are encouraged to add output statistics to standard output, such as total run time, averaged sentences-per-second and words-per-second, etc.
- Compressed in a known format: .tar.gz | .tar.bz2 | .7z | .rar | .zip
- Provided a MD5 checksum for integrity verification.
- The compressed file is to be hosted in a known cloud storage service (e.g Google Drive, Microsoft OneDrive) and given appropriate download permission.
- Multiple submissions are allowed, but only the last submission will be evaluated.
Organizers
Van-Vinh Nguyen (vinhnv@vnu.edu.vn)
Le-Minh Nguyen (nguyenml@jaist.ac.jp)
Hong-Viet Tran (vietcdcn1@gmail.com)