Text To Speech: Evaluating Vietnamese speech synthesis on common datasets

Important dates

  • July 15, 2019: Registration open

  • Sep 2, 2019: Training data released

  • Sep 16, 2019: Test set released

  • Sep 17, 2019: Test result submission

  • Oct 4, 2019: Technical report submission

  • Oct 13, 2019: Result announcement (workshop day)

This challenge has been designed for understanding and comparing research techniques in building Vietnamese corpus-based speech synthesizers on the same data.

The basic challenge is to take the released speech database, build a synthetic voice from the data and synthesize a prescribed set of test sentences. The sentences from each synthesizer will then be evaluated through listening tests.

Training data

Participants will be provided with two training datasets, each includes utterances and their corresponding texts in a text file:

  • A small training dataset: 1,000 utterances of a single speaker (about 45 minutes)

  • A big training dataset: 15,000 utterances of a single speaker (about 23 hours)

Test data

Participants will be provided with a test set for both experiments on the two above training datasets. This test set is a text file containing 60 utterance texts to be synthesized by the participant systems. Then the resulting synthesized utterances will be presented to three groups of listeners: speech experts, volunteers, and undergraduates.

Synthetic utterance submission

Participants are asked to wholly automatically synthesize utterances from text in the test set. Only 20 sentences randomly chosen for each listener from the 60 submitted were used in the listening tests, as listening to all 60 would require too many resources (Latin square will be used for this test). Our target was to have at least 10 listeners for each sample. Utterances should be in waveform files under the name format <small_or_big><team_name>_<utterance_number>.wav. The prefix of each file name is "small" or "big" to show the training dataset of the synthetic utterance. You should compress these utterances in a <team_name>.zip file and submit via a web service whose information will be sent to participants with the test set.