VLSP 2023 Challenge on Vietnamese Spoofing-Aware Speaker Verification Shared Task Registration

Important dates
(Timezone: UTC+7)

Aug 14, 2023: Registration open
Sept 14, 2023: Registration close
Oct 3, 2023: Training dataset release
Oct 20, 2023: Public test set release (maximum of 20 submissions per day)
Nov 17, 2023: Private test set release (maximum of 6 submissions)
Nov 17, 2023: Test result submission
Nov 26, 2023: Technical report submission
Dec 15-16, 2023: Result announcement - Workshop days.

Task Description

Speaker verification (SV) is the task of verifying whether an input utterance matches the claimed identity. The success of speaker verification systems heavily depends on large training datasets collected under real-world conditions. While common languages like English or Chinese have vastly available datasets, low-resource ones like Vietnamese remain limited. With the aim to leverage the development of Vietnamese speaker verification, the Vietnamese Spoofing-Aware Speaker Verification Challenge (VSASV) 2023 has been designed for understanding and comparing research SV techniques on Vietnam-Celeb, a large-scale dataset for Vietnamese speaker recognition.

This is the first spoofing-aware speaker verification challenge for Vietnamese. While the evaluation metric is the same as speaker verification - Equal Error Rate - we introduce spoofed negative samples created by synthesizing speech from target speakers or recording speech from different devices. By doing this, we encourage participants to develop SV systems that are jointly optimized for spoofing detection and speaker verification.

Basic Regulations

Any use of external speaker data and pre-trained models is PROHIBITED, even for pre-trained models on other tasks, e.g. speech recognition, text-to-speech, speech enhancement, voice activity detection...
Participants can use non-speech data (noise samples, impulse responses…) for augmenting and must specify and share with other teams.
Participants can create spoofed samples only with the provided data.
Participants can use data augmentation techniques only with the provided data from the organizer.
The challenge has a public and a private test set. The final standings for the task will be decided based on private test results. Teams may be required to provide source code to examine the final results.

Training Data

Participants will be using the Vietnam-Celeb dataset for model development. The Vietnam-Celeb dataset consists of 1,000 speakers and more than 87,000 utterances. The total duration of the dataset is 187 hours, with all utterances resampled to 16,000 Hz. The data covers a wide range of challenging scenarios, including interviews, podcasts, game shows, talk shows, and other types of entertainment videos. The audio samples also represent real-world conditions, with various types of noises, such as background chatting, music, and cheers.

Additionally, we will provide the participants with spoofed data generated from audio recordings collected from various sources. This will be beneficial for the teams in developing spoofing-aware SV systems.

Test Data

Train speakers and test speakers are mutually exclusive.

There will be two test sets given:

Public test: the public test contains both bona fide (real speech) and spoofed negative samples. The negative pairs are chosen randomly.
Private test: the private test set contains both bona fide and spoofed negative samples. The bona fide negative pairs are chosen such that the speakers have the same gender and dialect.

In the test sets, each record is a single line containing two fields separated by a tab character and in the following format:

enrollment_wav<TAB>test_wav<NEWLINE>

**where

enrollment_wav - The enrollment utterance

test_wav - The test utterance

For example:

enrollment_wav test_wav

file1.wav file2.wav

file1.wav file3.wav

file1.wav file4.wav

...

Evaluation metric

The performance of the models will be evaluated by the Equal Error Rate (EER) where the False Acceptance Rate (FAR) equals the False Rejection Rate (FRR).

Submission Guidelines

Multiple submissions are allowed but under a limitation of each phase, the evaluation result is based on the submission having the lowest EER.

The submission file comprises a header, a set of testing pairs, and a cosine similarity output by the system for the pair. The order of the pairs in the submission file must follow the same order as the pair list. A single line must contain 3 fields separated by tab characters in the following format:

enrollment_wav<TAB>test_wav<TAB>score<NEWLINE>

**where

enrollment_wav - The enrollment utterance

test_wav - The test utterance

score - The cosine similarity

For example:

enrollment_wav test_wav score

file1.wav file2.wav 0.81285

file1.wav file3.wav 0.01029

...

Organizers

Pham Viet Thanh
Nguyen Thi Thu Trang
Nguyen Xuan Thai Hoa
Hoang Long Vu

Hosted by: Hanoi University of Science and Technology

Contact at trangntt@soict.hust.edu.vn

Association for Vietnamese Language and Speech Processing

Search

Share This Page

Sponsors and Partners