VLSP 2025 Challenge on Vietnamese Voice Conversion
Updates
July 3, 2025: We would like to inform you that the dataset has been shared via Google Drive. If you have not received the email, please check your spam or junk folder. If you still cannot find it, feel free to contact us directly for assistance.
Important dates
July 1, 2025: Training data, public test release
July 10, 2025: All teams must report any external pretrained models or datasets they used to the organizers and share them with all other teams.
August 14, 2025: Private test release
August 14, 2025: System submission deadline
August 14, 2025: Private test results release
August 30, 2025: Technical report submission
September 27, 2025: Notification of acceptance
October 3, 2025: Camera-ready deadline
October 29-30, 2025: Conference dates
Registration
https://docs.google.com/forms/d/e/1FAIpQLSfiLdGcxMGO0f9JKN7u9PAwFYhnfCexzLO8bSsgEhWX48jnKQ/viewform
Task Description
Voice conversion (VC) is a rapidly evolving field at the intersection of speech processing and artificial intelligence, aiming to transform a speaker's voice into another's while preserving the original linguistic content. As the demand for personalized voice applications grows, the development of robust VC systems, especially for less-resourced languages like Vietnamese, becomes increasingly important.
The Vietnamese Voice Conversion Challenge 2025 focuses on building models that perform high-quality voice conversion in Vietnamese. Participants will work with a dedicated Vietnamese dataset and are tasked with developing systems that can accurately convert voices while maintaining the naturalness and intelligibility of speech.
This year’s VLSP Challenge features two tasks:
- Task 1: Teams are allowed to use pretrained models and external datasets. However, any pretrained models used must be publicly available—that is, accessible to anyone without requiring special access or permission. Additionally, teams must disclose the pretrained models they intend to use and share them with the organizers and all other teams. Teams must also inform the organizers in advance about any pretrained models they plan to use, including the specific purpose for which the model will be used, so that the organizers can verify their eligibility.
- Task 2: Teams are not allowed to use external datasets or any pretrained models for voice conversion or for tasks related to Vietnamese text.
Important rules for both tasks:
- All pretrained models (if used) must be approved by the organizers and shared with other teams before usage.
- Teams are required to share their source code, training scripts, inference code, and clear instructions for training and testing the models to ensure reproducibility. The organizers reserve the right to reproduce results in case of any concerns or doubts.
Evaluation
Three main criteria will be used to evaluate the submitted models:
- Speaker Similarity Score (SMOS): Measures how similar the converted speech is to the reference speech, based on human perceptual ratings. Scale - 0 to 100.
- Word Error Rate (WER): Evaluate the content accuracy by comparing the converted speech with the source speech using a pretrained ASR model. Scale - 0 to 100
- Mean Opinion Score (MOS): Measures the naturalness and quality of the converted speech, rated by human listeners. Scale - 0 to 100
Final Scoring Formula:
Score = 0.4×[SMOS(ref, out)-SMOS(src, out)]+0.3×MOS+0.3×(100-WER)
Where:
- SMOS(ref, out): Speaker similarity between reference audio and output converted audio.
- SMOS(src, out): Speaker similarity between source audio and output converted audio.
- MOS: Naturalness rating.
- WER: Word Error Rate (Calculated using ChunkFormer [4])
Training and Test Data
Participants will receive two Vietnamese speech datasets from multiple speakers: one with corresponding text labels and another consisting of only audio. The challenge emphasizes the generalization ability of models, as the test set will feature speakers not present in the training data. This requires models to effectively adapt to unseen voices.
Submission
Participants will receive a sample format from the organizers and must follow it when naming their converted audio outputs. After conversion, participants will submit only the audio files to the organizers.
Example format:
source_audio<TAB>target_audio<TAB>converted_audio_filename
Just replace converted_audio_filename
with the name of your generated audio file, following the same pattern as in the provided example.
Organizers
- Nguyễn Thị Thu Trang, Hanoi University of Science and Technology, trangntt@soict.hust.edu.vn
- Hữu Tường Tú, Hanoi University of Science and Technology, VNPT AI, huutu12312vn@gmail.com
- Lê Hoàng Anh Tuấn, Hanoi University of Science and Technology, tuanbkak66@gmail.com
References
- J. Li, W. Tu and L. Xiao, "Freevc: Towards High-Quality Text-Free One-Shot Voice Conversion," ICASSP 2023 – 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1-5, doi: 10.1109/ICASSP49357.2023.10095191.
- Shan, Siyuan & Li, Yang & Banerjee, Amartya & Oliva, Junier. (2024). Phoneme Hallucinator: One-Shot Voice Conversion via Set Expansion. Proceedings of the AAAI Conference on Artificial Intelligence. 38. 14910-14918. 10.1609/aaai.v38i13.29411.
- Huu Tuong Tu, Luong Thanh Long, Vu Huan, Nguyen Thi Phuong Thao , Nguyen Van Thang, Nguyen Tien Cuong, Nguyen Thi Thu Trang. “Voice Conversion for Low-Resource Languages via Knowledge Transfer and Domain-Adversarial Training”. ICASSP 2025 – 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hyderabad, India, 2023, pp. 1-5, doi: 10.1109/ICASSP49660.2025.10889083.
- K. Le, T. V. Ho, D. Tran and D. T. Chau, "ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription," ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025, pp. 1-5, doi: 10.1109/ICASSP49660.2025.10888640.
Sponsors and Partners