O-COCOSDA and VLSP 2022 - MSV Shared task: Multilingual Speaker Verification

Shared Task Registration Form

Important dates

  • July 25th, 2022: Challenge announcement on the COCOSDA Conference webpage as well as other publicity channels. Registration open.

  • September 7th, 2022: Common training dataset release

  • October 7th, 2022: Public test set release (maximum of 20 submissions per days)

  • October 15th, 2022: Private test set release (maximum of 6 submissions)

  • October 25th, 2022: Announcing the Top 3 to be presented at the conference

  • October 30th, 2022: Technical report submission

  • November 26th, 2022: Announcing the ranking winners.

Task Description

Speaker verification (SV) is a task of verifying whether an input utterance matches the claimed identity. Despite the growth in the development of speaker verification on VoxCeleb or CN-Celeb, which contains only English or Chinese speech samples, there is still very little research on methods for other languages, especially in low-resource scenarios. With the aim to leverage the development of speaker verification in Asian languages, the COCOSDA Multi-lingual Speaker Verification (MSV) Challenge 2022 has been designed for understanding and comparing research SV techniques on the same dataset AMSV, which includes mainly Asian languages.

In the training dataset, speaker identities are mostly decided based on devices recorded in a community project. Although pre-processed through a cleaning pipeline, there may be some noises with inaccurate labels in the training dataset, which raises another issue need to be solved in this challenge: weakly supervision with weakly labeled large-scale extra data.

Training Dataset

The common training dataset will contain utterances from 9 languages, of which 7 languages are from Asia: English, French, Uzbekistan, Hindi, Tamil, Chinese, Japanese, Vietnamese, and Thai.

The dataset will have two sub-distributions:

Sub-distribution With speaker labels Source/Domain Style

AMSV-CV: Common Voice data


Yes Blog posts, books, newspaper Read

AMSV-YT: Youtube data


No Movies, talk show, podcast, game show, TV reportage. Spontaneous

Sub-distribution of AMSV-CV (out-domain):

The utterances in Common Voice were constructed by having contributors to record their voices by reading sentences displayed on the screen. As a result, most of the speeches are of the reading style. The pre-defined sentences are taken from books, blog posts and newspaper.

CommonVoice was primarily not collected and validated for the speaker verification task mind but for the automatic speech recognition task, thus there are some mistakes in the labels of the data. Therefore, we provide the data and a basic pre-processing pipeline to clean the label noises. You will also be provided with the device information of the utterances in the Common Voice project to do further pre-processing.

Sub-distribution of AMSV-YT (In-domain):

AMSV-YT data was built to be a far more complex real-world representation compared to the AMSV-CV sub-distribution. Utterances are collected from multiple genres of spontaneous speech, including movies, talk show, game show and TV reportage. Furthermore, the data contain a large amount of real-world noise, as well as overlapping speakers and speaking style variations.


COCOSDA MSV challenge 2022 will feature three evaluation sub-tasks. Teams can participate in one, two or all of the sub-tasks:

  • Task-01 (Seen languages): Participants are asked to verify utterance pairs of the same speaker from 06 seen languages in the training set: Vietnamese, French, Chinese, Hindi, Thai, and Japanese.

  • Task-02 (Unseen languages): Participants are asked to verify utterance pairs of the same speaker from 03 unseen languages: Mongolian, Arabic, and Indonesian.

  • Task-03 (Cross-lingual): The enrollment and test utterances come from the same speaker but in different languages: Uyghur-Chinese.

Basic Regulations:

  • Any uses of external speaker data and pre-trained models are prohibited.

  • Participants can use non-speech data (noise samples, impulse responses…) for augmenting and must specify and share with other teams.

  • Each task has public and private test sets. Final standings for all tasks will be decided based on private test results. Teams may be required to provide source code to examine the final results.

Test data

Private test set will be made available for the three tasks: AMSV-T01, AMSV-T02 and AMSV-T03, as described in the Sub-tasks part:

  • AMSV-T01: Includes utterance pairs of the same speaker from 06 seen languages in the training set: Vietnamese, French, Chinese, Hindi, Thai, and Japanese.

  • AMSV-T02: Includes utterance pairs of the same speaker from 03 unseen languages: Mongolian, Arabic, and Indonesian.

  • AMSV-T03: Includes a bi-lingual data of Uyghur-Chinese for enrollment and test.

For the first two sub-tasks, the test sets will contain only in-domain utterances. Train speakers and test speakers are mutually exclusive.

In test sets, each record is a single line containing two fields separated by a tab character and in the following format:



enrollment_wav - The enrollment utterance

test_wav - The test utterance

For example:

enrollment_wav test_wav

file1.wav file2.wav

file1.wav file3.wav

file1.wav file4.wav


Evaluation metric

The performance of the models will be evaluated by the Equal Error Rate (EER) where the False Acceptance Rate (FAR) equals the False Rejection Rate (FRR):


For evaluating the test sets of AMSV-T01 and AMSV-T02, which contain multiple languages, the final results will be the average value of EER calculated independently for each language.

Submission Guidelines

Multiple submissions are allowed but under a limitation of each phase, the evaluation result is based on the submission having the lowest EER.

The submission file comprises a header, a set of testing pairs, and a cosine similarity output by the system for the pair. The order of the pairs in the submission file must follow the same order as the pair list. A single line must contain 3 fields separated by tab character in the following format:



enrollment_wav - The enrollment utterance

test_wav - The test utterance

score - The cosine similarity

For example:

enrollment_wav test_wav score

file1.wav file2.wav 0.81285

file1.wav file3.wav 0.01029

file1.wav file4.wav 0.45792



  • Nguyen Thi Thu Trang
  • Pham Viet Thanh
  • Tran Dang Tuyen
  • Huu Tuong Tu
  • Vi Thanh Dat

Hosted by: Hanoi University of Science and Technology (trangntt@soict.hust.edu.vn)