Automatic Speech Recognition | Association for Vietnamese Language and Speech Processing

Important dates

July 15, 2019: Registration open
August 1, 2019: Training data released
Oct 5, 2019: Test set released
Oct 6, 2019: Test result submission
Oct 4, 2019: Technical report submission
Oct 13, 2019: Result announcement (workshop day)

Automatic Speech Recognition (ASR) systems are used in many applications in our daily life including AI assistants, command controls, robots, call centers, etc. The ASR system works by transcribing input speech into a sequence of words using many techniques such as HMM-GMM or the fancy sequence-to-sequence model. A robust ASR system is the one that can accurately recognize words under various conditions of input signal.

The performance of an ASR system is measured by the amount of mistakes that the system makes, including:

Deletion: the system cannot recognize words that are spoken in the input audio.
Substitution: the system miss-recognizes the word to another one.
Insertion: the system recognizes words that are not spoken in the input audio.

This year, the organizer provides training data (see below), which participants can use for training models. However, there is no limitation to use other data sources.

Training Data

Number of utterances: 315449
Number of hours: ~415 hours
Transcription: manual label

Evaluation Data

Two evaluation sets will be made available: The test2018 set and the test2019 set which servers as progressive test set. Submitting results on both sets is mandatory.

Submission Guidelines

ASR Run Submission Format

Multiple run submissions are allowed, but participants must explicitly indicate one PRIMARY run. All other run submissions are treated as CONTRASTIVE runs. In case that none of the runs is marked as PRIMARY, the latest submission (according to the file time-stamp) for the respective track will be used as the PRIMARY run.

Runs have to be submitted as a gzipped TAR archive (format see below) and upload to an assigned folder. Participants will receive the folder URL after the registration.

Submissions have to be made in CTM format. See the ctm documentation in the NIST SCTK documentation for details. The confidence values are optional. The channel number has to be '1'. Scoring will be case-insensitive. Submissions have to be in UTF-8.

Output Conventions

Since there are cases that input speech can be interpreted in different ways, the below rules are applied to mitigate such an issue:

The text will be scored case-insensitive, but can be submitted case-sensitive
Numbers, dates etc. need to be transcribed in words as they are spoken, not in digits
Common acronyms such as NATO, EU, are written as one word, without any special markers between the letters. This applies no matter whether they are spoken as one word or spelled out as a letter sequence. All other letter spelling sequences are written as individual letters with space in between.
For English words, names of people and places in other languages such as Youtube, Facebook, are written as it, not in Vietnamese pronunciation.