Automatic Speech Recognition for Vietnamese

Participation Agreement

Important dates

  • Sep 10, 2020: Registration open

  • Sep 30, 2020: Registration closed

  • Oct 01, 2020: Training data for the ASR-T1 released

  • Nov 30, 2020: Test set released for both ASR-T1 and ASR-T2 

  • Dec 07, 2020: Test result submission

  • Dec 15, 2020: Technical report submission

  • Dec 18, 2020: Result announcement (workshop day)

VLSP2020 ASR will feature two evaluation tasks. Teams can participate in one of the tasks or both. Teams are encouraged to freely choose end-2-end or traditional development technology.

Task-01 (ASR-T1): Focusing on a full pipeline development of the ASR model from scratch. For this task, the organizer will provide a training dataset with 250h of transcribed audio. All participants are required to use only this provided data to develop models including acoustic and language models. Any use of another resource for model development is not acceptable. 

Task-02 (ASR-T2): Focusing on a noisy and spontaneous speech challenge. For this task, the organization will not provide training data, participants can use all available data sources to develop their models without any limitation.

Training Data:

For the ASR-T1 task:

  • Number of hours: ~250 hours

  • Transcription: manual label

Evaluation Data

Two evaluation sets will be made available which are vlsp2020-asr-t1 and vlsp2020-asr-t2 for tasks ASR-T1 and ASR-T2 respectively. The vlsp2020-asr-t1 will be an in-domain test-set and the vlsp2020-asr-t2 will be a noisy and spontaneous speech.

Evaluation metric:

The quality of the models will be evaluated by the Word Error Rate (WER) metric.

Submission Guidelines

ASR Run Submission Format

Multiple run submissions are allowed, but participants must explicitly indicate one PRIMARY run. All other run submissions are treated as CONTRASTIVE runs. In case that none of the runs is marked as PRIMARY, the latest submission (according to the file time-stamp) for the respective track will be used as the PRIMARY run.

Runs have to be submitted as a gzipped TAR archive (format see below) and upload to an assigned folder. Participants will receive the folder URL after the registration.

Submissions have to be made in CTM format. See the ctm documentation in the NIST SCTK documentation for details. The confidence values are optional. The channel number has to be '1'. Scoring will be case-insensitive. Submissions have to be in UTF-8.

Output Conventions

Since there are cases that input speech can be interpreted in different ways, the below rules are applied to mitigate such an issue:

  1. The text will be scored case-insensitive, but can be submitted case-sensitive

  2. Numbers, dates etc. need to be transcribed in words as they are spoken, not in digits

  3. Common acronyms such as NATO, EU, are written as one word, without any special markers between the letters. This applies no matter whether they are spoken as one word or spelled out as a letter sequence. All other letter spelling sequences are written as individual letters with space in between.

  4. For English words, names of people and places in other languages such as Youtube, Facebook, are written as it, not in Vietnamese pronunciation.