Important dates
(Timezone: UTC+7)
- Aug 15, 2023: Registration open
- Sept 15, 2023: Registration close
- Oct 3, 2023: Training dataset available
- Oct 30, 2023: TTS API submission
- Nov 26, 2023: Technical report submission
- Dec 10, 2023: Individual result announcement
- Dec 15-16, 2023: Result announcement - Workshop days
Task Description
Emotional speech synthesis (ESS) is the process of producing a realistic and natural-sounding voice that conveys specific emotions based on a given text input. This has emerged as a significant obstacle in the field of speech synthesis. Recent advancements have led to the development of various applications, including virtual assistants, call centers, dubbing for movies and games, audiobook narration, and online education.
In the VLSP 2023 TTS challenge, there will be two evaluation tasks for ESS, each focusing on generating emotional speech with different input variation information. Participating teams have the option to compete in either one or both of the tasks.
- Sub-task 1 – ESS with emotion adaptation: In this task, participants are requested to build a TTS system to synthesize the emotional speech of a single speaker, whose training dataset only includes neutral utterances. There are four emotional labels which the TTS system needs to synthesize: neutral, sad, happy, and angry. Participants are also provided with the emotional speech dataset of another speaker to support this sub-task.
- Sub-task 2 – Emotional style transfer in expressive speech synthesis: In this task, participants are requested to build a TTS system to synthesize expressive speech of a single speaker, whose training dataset only includes neutral utterances. The input of the TTS system includes a text and a reference audio of another speaker. The task of the TTS system is to synthesize audio with style and emotion closest to the reference audio. Participants are also provided with the emotional speech dataset of another speaker to support this sub-task. Both neutral speech dataset and emotional speech dataset are the same as sub-task 1.
Dataset
Participants will be provided with two datasets:
-
VLSP-2023-EMO: Emotional Speech Dataset includes about 4 hours of a single speaker, with 8351 sentences with their texts and one of emotional labels: neutral, sad, happy, angry and surprise.
-
VLSP-2023-NEU: Neutral Speech Dataset includes 6 neutral hours of another speaker, with 4647 sentences with their texts
Regulations
If you do not follow any terms in this section, your final result will not be accepted. These regulations are necessary to make the challenge fair for all teams.
- Datasets & Pre-trained models: Participants can only use corresponding released datasets for each sub-task for model development, i.e. VLSP-2023-EMO & VLSP-2023-NEU for both sub-task 1 and sub-task 2. Any uses of external speech data or pre-trained models are prohibited.
- Pre-trained Models/Tools: You can use open source tools or public pre-trained models in this challenge:
- They are not for Vietnamese language and do not relate to emotions
- You have to specify them to the organizers by Oct 03, 2023. They have to be accepted by organizers.
- Accepted models/tools will be shared with all teams for legal using in this challenge.
TTS API Submission
Requirements for TTS API can be downloaded at the end of this page. You have to submit TTS API (public API link) so that the organizer can use your API to synthesize utterances from text in the test set. If you do not have any server with a static IP to build a public API, you can deploy your system as a public image to the Docker Hub and submit its link.
Evaluation
The synthetic utterances for test sentences from each synthesizer will then be evaluated through perceptual tests. The synthesized utterances will be presented to three groups of listeners: speech experts, volunteers, and undergraduates. Perceptual test will be used to evaluate the intelligibility/naturalness and the emotion/speaker similarity of the synthetic voices. You can find the detail weighted metrics as followings:
- Sub-task 1 – ESS with emotion adaptation: Emotion 40%, Intelligibility SUS 15%, Naturalness MOS 15%, Speaker Similarity 20%, Solution & Technical report 10%
- Sub-task 2 – Emotional style transfer in expressive speech synthesis: Emotion 40%, Intelligibility SUS 15%, Naturalness MOS 15%, Speaker Similarity 20%, Solution & Technical report 10%
You are only evaluated after you submit technical reports. The top 3 teams may be required to provide source code to examine the final results.
Organizers
- Nguyen Thi Thu Trang
- Nguyen Hoang Ky
from School of Information and Communication Technology, Hanoi University of Science and Technology.
Contact at trangntt@soict.hust.edu.vn.