VLSP 2023 Challenge on Emotional Speech Synthesis

Important dates
(Timezone: UTC+7)

Aug 15, 2023: Registration open
Sept 15, 2023: Registration close
Oct 3, 2023: Training dataset available
Oct 30, 2023: TTS API submission
Nov 26, 2023: Technical report submission
Dec 10, 2023: Individual result announcement
Dec 15-16, 2023: Result announcement - Workshop days

Task Description

Emotional speech synthesis (ESS) is the process of producing a realistic and natural-sounding voice that conveys specific emotions based on a given text input. This has emerged as a significant obstacle in the field of speech synthesis. Recent advancements have led to the development of various applications, including virtual assistants, call centers, dubbing for movies and games, audiobook narration, and online education.

In the VLSP 2023 TTS challenge, there will be two evaluation tasks for ESS, each focusing on generating emotional speech with different input variation information. Participating teams have the option to compete in either one or both of the tasks.

Sub-task 1 – ESS with emotion adaptation: In this task, participants are requested to build a TTS system to synthesize the emotional speech of a single speaker, whose training dataset only includes neutral utterances. There are four emotional labels which the TTS system needs to synthesize: neutral, sad, happy, and angry. Participants are also provided with the emotional speech dataset of another speaker to support this sub-task.
Sub-task 2 – Emotional style transfer in expressive speech synthesis: In this task, participants are requested to build a TTS system to synthesize expressive speech of a single speaker, whose training dataset only includes neutral utterances. The input of the TTS system includes a text and a reference audio of another speaker. The task of the TTS system is to synthesize audio with style and emotion closest to the reference audio. Participants are also provided with the emotional speech dataset of another speaker to support this sub-task. Both neutral speech dataset and emotional speech dataset are the same as sub-task 1.

Dataset

Participants will be provided with two datasets:

VLSP-2023-EMO: Emotional Speech Dataset includes about 4 hours of a single speaker, with 8351 sentences with their texts and one of emotional labels: neutral, sad, happy, angry and surprise.
VLSP-2023-NEU: Neutral Speech Dataset includes 6 neutral hours of another speaker, with 4647 sentences with their texts

Regulations

If you do not follow any terms in this section, your final result will not be accepted. These regulations are necessary to make the challenge fair for all teams.

Datasets & Pre-trained models: Participants can only use corresponding released datasets for each sub-task for model development, i.e. VLSP-2023-EMO & VLSP-2023-NEU for both sub-task 1 and sub-task 2. Any uses of external speech data or pre-trained models are prohibited.
Pre-trained Models/Tools: You can use open source tools or public pre-trained models in this challenge:
- They are not for Vietnamese language and do not relate to emotions
- You have to specify them to the organizers by Oct 03, 2023. They have to be accepted by organizers.
- Accepted models/tools will be shared with all teams for legal using in this challenge.

TTS API Submission

Requirements for TTS API can be downloaded at the end of this page. You have to submit TTS API (public API link) so that the organizer can use your API to synthesize utterances from text in the test set. If you do not have any server with a static IP to build a public API, you can deploy your system as a public image to the Docker Hub and submit its link.

Evaluation

The synthetic utterances for test sentences from each synthesizer will then be evaluated through perceptual tests. The synthesized utterances will be presented to three groups of listeners: speech experts, volunteers, and undergraduates. Perceptual test will be used to evaluate the intelligibility/naturalness and the emotion/speaker similarity of the synthetic voices. You can find the detail weighted metrics as followings:

Sub-task 1 – ESS with emotion adaptation: Emotion 40%, Intelligibility SUS 15%, Naturalness MOS 15%, Speaker Similarity 20%, Solution & Technical report 10%
Sub-task 2 – Emotional style transfer in expressive speech synthesis: Emotion 40%, Intelligibility SUS 15%, Naturalness MOS 15%, Speaker Similarity 20%, Solution & Technical report 10%

You are only evaluated after you submit technical reports. The top 3 teams may be required to provide source code to examine the final results.

Organizers

Nguyen Thi Thu Trang
Nguyen Hoang Ky

from School of Information and Communication Technology, Hanoi University of Science and Technology.

Contact at trangntt@soict.hust.edu.vn.

Association for Vietnamese Language and Speech Processing

Search

Share This Page

Sponsors and Partners