VLSP 2023 Challenge on Automatic Speech Recognition and Speech Emotion Recognition

Shared Task Registration Form

Important dates
(Timezone: UTC+7)

  • Aug 14, 2023: Registration opens
  • Sept 04, 2023: Registration closes
  • Sept 06, 2023: Dataset building starts
  • Sept 17, 2023: Dataset building ends
  • Sept 20, 2023: Training data and public test data released
  • Nov 17, 2023: Private test set released 
  • Nov 17, 2023: Test result submission
  • Nov 26, 2023: Technical report submission
  • Dec 15-16, 2023: Result announcement - Workshop days

General Description
The challenge focuses on a full pipeline development of the ASR model and SER model from scratch with limited training data conditions. Two emotion categories are “neutral” and “negative”. The organizer will provide 4 different training datasets (released by TLU and NamiTech).

Dataset

Amount (hours)

Text Label

Emotion Label

Dataset 1

200

No

No

Dataset 2

60

Yes

No

Dataset 3

5

No

Yes

Dataset 4

40

No

Yes (Low quality)

Notes:

  • All participants are required to label (text and emotion) for a small amount of unlabeled data given by the organizer.
  • All participants are required to use only this provided data to develop models. Any use of another resource for model development is not acceptable. 

Evaluation metrics

For each given utterance, two outputs will be submitted.

  • Text sequence (ASR output)
  • Emotion label (Emotion output)

The quality of the models will be evaluated by the Syllable Error Rate (SyERASR) and Emotion Recognition Accuracy (ACCSER) metrics.

SyERASR = (S+D+I)/N, where

  • S is the number of substitutions,
  • D is the number of deletions,
  • I is the number of insertions,
  • C is the number of correct syllables,
  • N is the number of syllables in the reference (N=S+D+C)

ACCSER = (NEUCorr/NEU + NEGCorr/NEG)/2, where

  • NEUCorr is the number of correct neutral emotion utterances
  • NEU is the number of total neutral utterances
  • NEGCorr is the number of correct negative emotion utterances
  • NEG is the number of total negative utterances.

The overall result is calculated as

Score = 0.7*(1-SyERASR) + 0.3*ACCSER

Submission Guidelines

Submission Format

Submissions have to be made in UTF-8, lower-case and one line for each utterance.

utterance_name<TAB>emotion_label<TAB>recognized_text_sequence

For example,

0001.wav          neutral               chào mừng các bạn đã tham dự cuộc thi
0002.wav         negative             tôi sẽ kiện ra toà

Output Conventions

Since there are cases that input speech can be interpreted in different ways, the below rules are applied to mitigate such an issue:

  • Numbers, dates etc. need to be transcribed in words as they are spoken, not in digits.
  • Common acronyms such as nato, fifa, are written as one word, without any special markers between the letters. This applies no matter whether they are spoken as one word or spelled out as a letter sequence. All other letter spelling sequences are written as individual letters with space in between.
  • For English words, names of people and places in other languages such as youtube, facebook, are written as it, not in Vietnamese pronunciation.
     

Organizers

  • Đỗ Văn Hải - TLU - haidv@tlu.edu.vn
  • Cao Mạnh Hải - NamiTech - manhhai.cao@namitech.io
  • Lê Quang Trung - GoGa – trungle@goga.ai