VLSP 2022 TTS: Emotional speech synthesis

Shared Task Registration Form

Important Dates

July 27, 2022: Registration opens
Aug 31, 2022: Registration closes
Sept 01, 2022: Dataset building starts
Sept 15, 2022: Dataset building ends
~~Oct 1, 2022~~ Oct 3, 2022: Training dataset available
~~Oct 18, 2022~~ Oct 20, 2022: TTS API submission,
Oct 27, 2022: Technical report submission, Evaluation phase starts
Nov 12, 2022: Evaluation phase ends, Individual result announcement
Nov 26, 2022: Result announcement (workshop day)

Task Description

Emotional speech synthesis (ESS) allows to generate humanlike natural-sounding voice from a given input text with desired emotional expression. This problem has become an important challenge in speech synthesis. The recent advances have enabled many applications such as virtual assistants, call centers, dubbing of movies and games, audiobook narration, and online education.
VLSP 2022 TTS will feature two evaluation tasks for ESS with 4 emotional labels: neutral, sad, happy, and angry. Teams can participate in one of the tasks or both.

Sub-task 1 – ESS with a single speaker: In this task, participants are requested to build a TTS system to synthesize emotional speech of a single speaker. Participants will be provided with emotional training dataset of this speaker.
Sub-task 2 – ESS with speaker adaptation: In this task, participants are requested to build a TTS system to synthesize emotional speech of another speaker, whose training dataset only includes neutral utterances. Participants are also provided with the emotional speech dataset in sub-task 1.

Dataset

Participants have to join to build the datasets before receiving them. The main task is to correct the emotional labels for a part of the datasets. Participants will be provided with two datasets:

VLSP-EMO: Emotional Speech Dataset includes about 5 hours of a single speaker, which was collected from film and interviews. This voice has 4 emotional labels: neutral, sad, happy, and angry.
VLSP-NEU: Neutral Speech Dataset includes 4 neutral hours of another speaker

Regulations

If you do not follow any terms in this section, your final result will not be accepted. These regulations are necessary to make the challenge fair for all teams.

Datasets & Pre-trained models: Participants can only use corresponding released datasets for each sub-task for model development, i.e. VLSP-EMO for sub-task 1, VLSP-EMO & VLSP-ENU for sub-task 2. Any uses of external speech data or pre-trained models are prohibited.
Pre-trained Models/Tools: You can use open source tools or public pre-trained models in this challenge:
- They are not for Vietnamese language and do not relate to emotions
- You have to specify them to the organizers by Sept 30, 2022. They have to be accepted by organizers.
- Accepted models/tools will be shared with all teams for legal using in this challenge.

TTS API Submission

Requirements for TTS API can be downloaded at the end of this page. You have to submit TTS API (public API link) so that the organizer can use your API to synthesize utterances from text in the test set. If you do not have any server with a static IP to build a public API, you can deploy your system as a public image to the Docker Hub and submit its link.

Evaluation

The synthetic utterances for test sentences from each synthesizer will then be evaluated through perceptual tests. The synthesized utterances will be presented to three groups of listeners: speech experts, volunteers, and undergraduates. Perceptual test will be used to evaluate the intelligibility/naturalness and the emotion/speaker similarity of the synthetic voices. You can find the detail weighted metrics as followings:

Sub-task 1 – ESS with a single speaker: Emotion 50%, Intelligibility SUS 20%, Naturalness MOS 20%, Solution & Technical report 10%
Sub-task 2 – ESS with speaker adaptation: Emotion 40%, Intelligibility SUS 15%, Naturalness MOS 15%, Speaker Similarity 20%, Solution & Technical report 10%

You are only evaluated after you submit technical reports. The top 3 teams may be required to provide source code to examine the final results.

Organizers

Nguyen Thi Thu Trang
Nguyen Thi Ngoc Anh
Le Minh Nguyen

from School of Information and Communication Technology, Hanoi University of Science and Technology. Contact at trangntt@soict.hust.edu.vn.

File

API-Spec-TTS-VLSP-2022.pdf (129.96 KB)

Association for Vietnamese Language and Speech Processing

Search