Skip to main content

Association for Vietnamese Language and Speech Processing

A chapter of VAIP - Vietnam Association for Information Processing

VLSP 2025 Challenge on Vietnamese Voice Conversion

Updates

July 3, 2025: We would like to inform you that the dataset has been shared via Google Drive. If you have not received the email, please check your spam or junk folder. If you still cannot find it, feel free to contact us directly for assistance.

Important dates 

July 1, 2025: Training data, public test release

July 10, 2025: All teams must report any external pretrained models or datasets they used to the organizers and share them with all other teams.

August 14, 2025: Private test release

August 14, 2025: System submission deadline

August 14, 2025: Private test results release

August 30, 2025: Technical report submission

September 27, 2025: Notification of acceptance

October 3, 2025: Camera-ready deadline

October 29-30, 2025: Conference dates

Registration

https://docs.google.com/forms/d/e/1FAIpQLSfiLdGcxMGO0f9JKN7u9PAwFYhnfCexzLO8bSsgEhWX48jnKQ/viewform

Task Description

Voice conversion (VC) is a rapidly evolving field at the intersection of speech processing and artificial intelligence, aiming to transform a speaker's voice into another's while preserving the original linguistic content. As the demand for personalized voice applications grows, the development of robust VC systems, especially for less-resourced languages like Vietnamese, becomes increasingly important.

The Vietnamese Voice Conversion Challenge 2025 focuses on building models that perform high-quality voice conversion in Vietnamese. Participants will work with a dedicated Vietnamese dataset and are tasked with developing systems that can accurately convert voices while maintaining the naturalness and intelligibility of speech.

This year’s VLSP Challenge features two tasks:

  • Task 1: Teams are allowed to use pretrained models and external datasets. However, any pretrained models used must be publicly available—that is, accessible to anyone without requiring special access or permission. Additionally, teams must disclose the pretrained models they intend to use and share them with the organizers and all other teams. Teams must also inform the organizers in advance about any pretrained models they plan to use, including the specific purpose for which the model will be used, so that the organizers can verify their eligibility.
  • Task 2: Teams are not allowed to use external datasets or any pretrained models for voice conversion or for tasks related to Vietnamese text.

Important rules for both tasks:

  • All pretrained models (if used) must be approved by the organizers and shared with other teams before usage.
  • Teams are required to share their source code, training scripts, inference code, and clear instructions for training and testing the models to ensure reproducibility. The organizers reserve the right to reproduce results in case of any concerns or doubts.
Evaluation

Three main criteria will be used to evaluate the submitted models:

  • Speaker Similarity Score (SMOS): Measures how similar the converted speech is to the reference speech, based on human perceptual ratings. Scale - 0 to 100.
  • Word Error Rate (WER): Evaluate the content accuracy by comparing the converted speech with the source speech using a pretrained ASR model. Scale - 0 to 100
  • Mean Opinion Score (MOS): Measures the naturalness and quality of the converted speech, rated by human listeners. Scale - 0 to 100

 

Final Scoring Formula:

Score = 0.4×[SMOS(ref, out)-SMOS(src, out)]+0.3×MOS+0.3×(100-WER)

Where:

  • SMOS(ref, out): Speaker similarity between reference audio and output converted audio.
  • SMOS(src, out): Speaker similarity between source audio and output converted audio.
  • MOS: Naturalness rating.
  • WER: Word Error Rate (Calculated using ChunkFormer [4])
Training and Test Data

Participants will receive two Vietnamese speech datasets from multiple speakers: one with corresponding text labels and another consisting of only audio. The challenge emphasizes the generalization ability of models, as the test set will feature speakers not present in the training data. This requires models to effectively adapt to unseen voices.

Submission

Participants will receive a sample format from the organizers and must follow it when naming their converted audio outputs. After conversion, participants will submit only the audio files to the organizers.

Example format:

source_audio<TAB>target_audio<TAB>converted_audio_filename

Just replace converted_audio_filename with the name of your generated audio file, following the same pattern as in the provided example.

Organizers
References
  1. J. Li, W. Tu and L. Xiao, "Freevc: Towards High-Quality Text-Free One-Shot Voice Conversion," ICASSP 2023 – 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1-5, doi: 10.1109/ICASSP49357.2023.10095191.
  2. Shan, Siyuan & Li, Yang & Banerjee, Amartya & Oliva, Junier. (2024). Phoneme Hallucinator: One-Shot Voice Conversion via Set Expansion. Proceedings of the AAAI Conference on Artificial Intelligence. 38. 14910-14918. 10.1609/aaai.v38i13.29411.
  3. Huu Tuong Tu, Luong Thanh Long, Vu Huan, Nguyen Thi Phuong Thao , Nguyen Van Thang, Nguyen Tien Cuong, Nguyen Thi Thu Trang. “Voice Conversion for Low-Resource Languages via Knowledge Transfer and Domain-Adversarial Training”. ICASSP 2025 – 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hyderabad, India, 2023, pp. 1-5, doi: 10.1109/ICASSP49660.2025.10889083.
  4. K. Le, T. V. Ho, D. Tran and D. T. Chau, "ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription," ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025, pp. 1-5, doi: 10.1109/ICASSP49660.2025.10888640.

Sponsors and Partners

VinBIGDATA   VinIF  AIMESOFT  bee  Dagoras            

 

 zalo    VTCC  VCCorp

 

 

IOIT  HUS  USTH  UET    TLU  UIT  INT2  jaist  VIETLEX