Skip to main content

Association for Vietnamese Language and Speech Processing

A chapter of VAIP - Vietnam Association for Information Processing

VLSP 2025 Automatic Speech Recognition and Speech Emotion Recognition

Important dates

June 23, 2025: Registration open

July 1, 2025: Training data, public test release

August 14, 2025: Private test release

August 14, 2025: System submission deadline

August 14, 2025: Private test results release

August 30, 2025: Technical report submission

September 27, 2025: Notification of acceptance

October 3, 2025: Camera-ready deadline

October 29-30, 2025: Conference dates

 

General Description

The challenge centers on developing models for Automatic Speech Recognition (ASR) and Speech Emotion Recognition (SER), focusing on two emotion categories: "neutral" and "negative." Participants are encouraged to create a single model capable of recognizing both text and emotion labels.

This year, in addition to the data provided by the organizers, teams can utilize external resources like pretrained models and open datasets. Before the competition starts, teams are encouraged to propose external resources. The organizers will review these suggestions and select resources based on criteria such as accuracy, popularity, and size to ensure fairness. During the competition, teams are only permitted to use the resources approved by the organizers.

Evaluation Metric

For each given utterance, two outputs will be submitted.

  • Text sequence (ASR output)
  • Emotion label (Emotion output)

The quality of the models will be evaluated by the the Syllable Error Rate (SyERASR) and Emotion Recognition Accuracy (ACCSER) metrics.

SyERASR = (S+D+I)/N

where

  • S is the number of substitutions,
  • D is the number of deletions,
  • I is the number of insertions,
  • C is the number of correct syllables,
  • N is the number of syllables in the reference (N=S+D+C)

ACCSER = (NEUCorr/NEU + NEGCorr/NEG)/2

where 

  • NEUCorr is the number of correct neutral emotion utterances
  • NEU is the number of total neutral utterances
  • NEGCorr is the number of correct negative emotion utterances
  • NEG is the number of total negative utterances

The overall result is calculated as

Score = 0.7*(1-SyERASR) + 0.3*ACCSER 

Submission Guidelines

Submission Format

Submissions have to be made in UTF-8, lower-case and one line for each utterance. 

utterance_name<TAB>emotion_label<TAB>recognized_text_sequence

For example

0001.wav         neutral              chào mừng các bạn đã tham dự cuộc thi 

0002.wav        negative           tôi sẽ kiện ra toà 

Output Conventions

Since there are cases that input speech can be interpreted in different ways, the below rules are applied to mitigate such an issue:

  1. Numbers, dates etc., need to be transcribed in words as they are spoken, not in digits
  2. Common acronyms such as natofifa, are written as one word, without any special markers between the letters. This applies no matter whether they are spoken as one word or spelled out as a letter sequence. All other letter spelling sequences are written as individual letters with space in between.
  3. For English words, names of people and places in other languages such as youtubefacebook, are written as it, not in Vietnamese pronunciation.

Contact

Zalo Group: https://zalo.me/g/zjqswf140

Registration

https://forms.gle/WaCc9peg1ZhZ67hf6

Organizers

  • Cao Mạnh Hải - NamiTech - manhhai.cao@namitech.io
  • Hoàng Chí Dũng - NamiTech - dung.hoangchi@namitech.io
  • Lê Quang Trung - Torilab - trungle.bka@gmail.com
  • Đỗ Văn Hải - Thuyloi University - haidv@tlu.edu.vn

 

Sponsors and Partners

VinBIGDATA   VinIF  AIMESOFT  bee  Dagoras            

 

 zalo    VTCC  VCCorp

 

 

IOIT  HUS  USTH  UET    TLU  UIT  INT2  jaist  VIETLEX