Automatic Speech Recognition

Important dates

  • July 15, 2019: Registration open

  • August 1, 2019: Training data released

  • Oct 5, 2019: Test set released

  • Oct 6, 2019: Test result submission

  • Oct 4, 2019: Technical report submission

  • Oct 13, 2019: Result announcement (workshop day)

Automatic Speech Recognition (ASR) systems are used in many applications in our daily life including AI assistants, command controls, robots, call centers, etc. The ASR system works by transcribing input speech into a sequence of words using many techniques such as HMM-GMM or the fancy sequence-to-sequence model. A robust ASR system is the one that can accurately recognize words under various conditions of input signal.

The performance of an ASR system is measured by the amount of mistakes that the system makes, including:

  1. Deletion: the system cannot recognize words that are spoken in the input audio.

  2. Substitution: the system miss-recognizes the word to another one.

  3. Insertion: the system recognizes words that are not spoken in the input audio.

This year, the organizer provides training data (see below), which participants can use for training models. However, there is no limitation to use other data sources.

Training Data

  • Number of utterances: 315449

  • Number of hours: ~415 hours

  • Transcription: manual label

Evaluation Data

Two evaluation sets will be made available: The test2018 set and the test2019 set which servers as progressive test set. Submitting results on both sets is mandatory.

Submission Guidelines

ASR Run Submission Format

Multiple run submissions are allowed, but participants must explicitly indicate one PRIMARY run. All other run submissions are treated as CONTRASTIVE runs. In case that none of the runs is marked as PRIMARY, the latest submission (according to the file time-stamp) for the respective track will be used as the PRIMARY run.

Runs have to be submitted as a gzipped TAR archive (format see below) and upload to an assigned folder. Participants will receive the folder URL after the registration.

Submissions have to be made in CTM format. See the ctm documentation in the NIST SCTK documentation for details. The confidence values are optional. The channel number has to be '1'. Scoring will be case-insensitive. Submissions have to be in UTF-8.

Output Conventions

Since there are cases that input speech can be interpreted in different ways, the below rules are applied to mitigate such an issue:

  1. The text will be scored case-insensitive, but can be submitted case-sensitive

  2. Numbers, dates etc. need to be transcribed in words as they are spoken, not in digits

  3. Common acronyms such as NATO, EU, are written as one word, without any special markers between the letters. This applies no matter whether they are spoken as one word or spelled out as a letter sequence. All other letter spelling sequences are written as individual letters with space in between.

  4. For English words, names of people and places in other languages such as Youtube, Facebook, are written as it, not in Vietnamese pronunciation.