VLSP 2013 - Translation Task

Introduction

Machine translation is one of the traditional and difficult tasks in the field of NLP. Recently, due to the rapidly increasing amount of data and computing power in ubiquitous environments, the development of Vietnamese-related translation systems has attracted not only academic institutes but also R&D units from companies, both from inland Vietnam and overseas, even at the global scope.

This campaign aims to create an authentic environment to automatically and manually evaluate translation system so that it helps to boost the research on Machine Translation of the Vietnamese Language Technology community. We also welcome research groups improve the methods and bring their systems into the real-world scenario. Further domestic and international collaborations on the field are included in our goals.

Task Description

The task is fully using computers and softwares to automatically produce text translation of TED talks for English and Vietnamese. The participants can do one direction, either from English to Vietnamese or from Vietnamese to English, but the translations for two directions are strongly encouraged. The softwares to be used should be self-developed or open-sourced, illustrating the research of participants, otherwise, those systems would be classified as unconstrained systems. 

Unconstrained systems would not be human-evaluated and ranked. Unconstrained systems include the systems which use commercial translation products developed by people rather than the participants (e.g. Systran products), the systems which have a free software developed by others playing the main part of the translation process (e.g. Google Translate or Bing) and the systems which use the data not to be provided by the workshop. You can, however, use other data and systems to demonstrate the significant improvements by means of large data and report them in your system paper.

To have fair evaluations, participants might be asked individually for providing more information about their system, e.g. language model or translation model produced from the data (in case that they follow statistical machine translation).

Evaluation Metrics

Training and Test Data

  • Parallel Corpus:

            TED talks' English-Vietnamese subtitle corpus.

            Download:

                        -  Training set: 212,454 sentence pair from 875 talks  TBA.

                        -   Development set: 2003 sentence pairs from 6 talks  TBA.

  • Monolingual Corpora:

English: The English data are from the WMT 2013 webpage

Vietnamese:

  • Online news crawled from Internet
  • Vietnamese part of TED parallel corpus.

Data Format

  • Input format:
    • For the TED parallel data, training, development and test sets will be provided as UTF-8 plaintexts, 1-to-1 sentence aligned, one “sentence” per line. Notice that “sentence” here is not necessarily a linguistic sentence but maybe phrases.
    • For the monolingual corpora, we provide UTF-8 plaintexts, one “sentence” per line as you would see when you downloaded them.
  • Output format:  UTF-8, precomposed Unicode plaintexts, one sentence per line.  Participants might choose appropriate casing methods in the preprocessing steps: true casing, lowercasing or leaving it all along, then the very-end outputs will be lowercased by organizers and compared to lowercased references anyway.
  • You might want to use some scripts from Moses to do casing and normalizing texts before training Tokenizer tokenizer.perl
    • Detokenizer detokenizer.perl
    • Lowercaser lowercase.perl

These tools are available in the Moses git repository:

https://github.com/moses-smt/mosesdecoder

Copyrights of the data – Acknowledgment:

TED makes its collection of video recordings and subtitles of talks available under the Creative Commons BY-NC-ND license (look http://www.ted.com/pages/talk_usage_policy). We acknowledge the authorship of TED talks (BY condition) and does not redistribute subtitles for commercial purposes (NC). We use the data for research purposes only. As regards the integrity of the work (ND), we only change the format of the container, while preserving the original contents. The participants must conform to the TED Talks usage policy. We are not responsible for any kind of violation from the participants.

We also acknowledge the compilation of corpora from WMT2013 web page as well as  Vietnamese monolingual data. The participants must confirm the copyright of those data and use them for research purposes only.