VLSP 2025 Challenge on Vietnamese Voice Conversion

Updates

July 22, 2025: After reviewing the models and datasets you submitted, we have prepared a sheet indicating which models or datasets are allowed for Task 1 and Task 2: https://docs.google.com/spreadsheets/d/1q2VNs8quqTFfysfzXXrwmOjAn2hhga4YkwXnBVU3Kis/edit?usp=sharing

To facilitate easier communication, we’ve also created a Zalo group. You can join the group using the following link: https://zalo.me/g/zgmpjy844

Important dates

July 1, 2025: Training data, public test release

July 10, 2025: All teams must report any external pretrained models or datasets they used to the organizers and share them with all other teams.

August 14, 2025: Private test release

August 17, 2025: Private test submission deadline

August 23, 2025: Private test results release

August 30, 2025: Technical report submission

September 27, 2025: Notification of acceptance

October 3, 2025: Camera-ready deadline

October 29-30, 2025: Conference dates

Registration

https://docs.google.com/forms/d/e/1FAIpQLSfiLdGcxMGO0f9JKN7u9PAwFYhnfCexzLO8bSsgEhWX48jnKQ/viewform

Task Description

Voice conversion (VC) is a rapidly evolving field at the intersection of speech processing and artificial intelligence, aiming to transform a speaker's voice into another's while preserving the original linguistic content. As the demand for personalized voice applications grows, the development of robust VC systems, especially for less-resourced languages like Vietnamese, becomes increasingly important.

The Vietnamese Voice Conversion Challenge 2025 focuses on building models that perform high-quality voice conversion in Vietnamese. Participants will work with a dedicated Vietnamese dataset and are tasked with developing systems that can accurately convert voices while maintaining the naturalness and intelligibility of speech.

This year’s VLSP Challenge features two tasks:

Task 1: Teams are allowed to use pretrained models and external datasets. However, any pretrained models used must be publicly available—that is, accessible to anyone without requiring special access or permission. Additionally, teams must disclose the pretrained models they intend to use and share them with the organizers and all other teams. Teams must also inform the organizers in advance about any pretrained models they plan to use, including the specific purpose for which the model will be used, so that the organizers can verify their eligibility.
Task 2: Teams are not allowed to use external datasets or any pretrained models for voice conversion or for tasks related to Vietnamese text (e.g., Text-to-Speech or Automatic Speech Recognition).

Important rules for both tasks:

All pretrained models (if used) must be approved by the organizers and shared with other teams before usage.
Teams are required to share their source code, training scripts, inference code, and clear instructions for training and testing the models to ensure reproducibility. The organizers reserve the right to reproduce results in case of any concerns or doubts.

Evaluation

Three main criteria will be used to evaluate the submitted models:

Speaker Similarity Score (SMOS): Measures how similar the converted speech is to the reference speech, based on human perceptual ratings. Scale - 0 to 100.
Word Error Rate (WER): Evaluate the content accuracy by comparing the converted speech with the source speech using a pretrained ASR model. Scale - 0 to 100
Mean Opinion Score (MOS): Measures the naturalness and quality of the converted speech, rated by human listeners. Scale - 0 to 100

Final Scoring Formula:

Score = 0.4×[SMOS(ref, out)-SMOS(src, out)]+0.3×MOS+0.3×(100-WER)

Where:

SMOS(ref, out): Speaker similarity between reference audio and output converted audio.
SMOS(src, out): Speaker similarity between source audio and output converted audio.
MOS: Naturalness rating.
WER: Word Error Rate (Calculated using ChunkFormer [4])

Training and Test Data

Participants will receive two Vietnamese speech datasets from multiple speakers: one with corresponding text labels and another consisting of only audio. The challenge emphasizes the generalization ability of models, as the test set will feature speakers not present in the training data. This requires models to effectively adapt to unseen voices.

Submission

Participants will receive a sample format from the organizers and must follow it when naming their converted audio outputs. After conversion, participants will submit only the audio files to the organizers.

Example format:

source_audio<TAB>target_audio<TAB>converted_audio_filename

Just replace converted_audio_filename with the name of your generated audio file, following the same pattern as in the provided example.

Organizers

Nguyễn Thị Thu Trang, Hanoi University of Science and Technology, trangntt@soict.hust.edu.vn
Hữu Tường Tú, Hanoi University of Science and Technology, VNPT AI, huutu12312vn@gmail.com
Lê Hoàng Anh Tuấn, Hanoi University of Science and Technology, tuanbkak66@gmail.com

References

J. Li, W. Tu and L. Xiao, "Freevc: Towards High-Quality Text-Free One-Shot Voice Conversion," ICASSP 2023 – 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1-5, doi: 10.1109/ICASSP49357.2023.10095191.
Shan, Siyuan & Li, Yang & Banerjee, Amartya & Oliva, Junier. (2024). Phoneme Hallucinator: One-Shot Voice Conversion via Set Expansion. Proceedings of the AAAI Conference on Artificial Intelligence. 38. 14910-14918. 10.1609/aaai.v38i13.29411.
Huu Tuong Tu, Luong Thanh Long, Vu Huan, Nguyen Thi Phuong Thao , Nguyen Van Thang, Nguyen Tien Cuong, Nguyen Thi Thu Trang. “Voice Conversion for Low-Resource Languages via Knowledge Transfer and Domain-Adversarial Training”. ICASSP 2025 – 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hyderabad, India, 2023, pp. 1-5, doi: 10.1109/ICASSP49660.2025.10889083.
K. Le, T. V. Ho, D. Tran and D. T. Chau, "ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription," ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025, pp. 1-5, doi: 10.1109/ICASSP49660.2025.10888640.

Association for Vietnamese Language and Speech Processing

Search