VLSP 2021 - Vietnamese Text-To-Speech

Evaluating spontaneous Vietnamese Text-To-Speech (TTS) on common datasets

The VLSP Speech Synthesis Challenge 2021 has been designed for understanding and comparing research techniques in building Vietnamese spontaneous speech synthesizers on the same data.

The basic challenge is to take the released speech database, build a TTS system with a training voice from the data. The synthetic utterances for test sentences from each synthesizer will then be evaluated through listening tests.

Participators have to join to build the dataset before receiving it. The main task is to transcribe or to correct the transcription for a part of the dataset. 

Important dates

  • Aug 5, 2021: Registration opens

  • Aug 30, 2021: Registration closes

  • Sept 01, 2021: Dataset building starts

  • Sept 15, 2021: Dataset building ends

  • Oct 1, 2021: Oct 6, 2021: Training dataset available

  • Oct 18, 2021: Oct 30, 2021: TTS API submission, Evaluation phase starts

  • Nov 12, 2021: Nov 27, 2021: Evaluation phase ends, Individual result announcement

  • Nov 20, 2021: Dec 6, 2021: Technical report submission

  • Nov 26, 2021: Dec 18, 2021: Result announcement (workshop day)

Training datasets

You can only use the provided training dataset that the organizers provide to build the TTS model. The duration of the dataset is 7-8 hours of a single speaker, which is recorded in a spontaneous environment. You will be provided a training dataset after participating in the dataset building. Participants can only use this dataset for model development. Any use of additional data for model training is prohibited.

You can use public pre-trained models in this task but you must specify them to the organizers before receiving the dataset. These pre-trained models will be shared with other teams.


Requirements for TTS API can be downloaded at https://tts.vlsp.org.vn/TTS_API_spec.pdf. You have to submit TTS API (public API link) so that the organizer can use your API to synthesize utterances from text in the test set. If you do not have any server with a static IP to build a public API, you can deploy your system as a public image to the Docker Hub and submit its link.


There are about more than 100 sentences in the test set. You will receive your synthetic utterances and your results when the evaluation phase ends. The synthesized utterances will be presented to three groups of listeners: speech experts, volunteers, and undergraduates. Perceptual test will be used to evaluate the intelligibility and naturalness of the synthetic voices.

