VLSP 2023 Invited Talk: An overview of foundation models for Vietnamese language processing

Nguyen Quoc Dat

Dat Quoc Nguyen (Ph.D.) is a Senior Research Scientist and the Head of the Natural Language Processing department at VinAI Research, Vietnam. He was an Honorary Fellow in the School of Computing and Information Systems at the University of Melbourne, Australia, where previously he was a Research Fellow. Before that, he received his Ph.D. in Computer Science from Macquarie University, Australia. Dat Quoc Nguyen authored 70+ scientific papers covering core NLP problems, ML methods for NLP and their applications for low-resource languages and specific domains, achieving an h-index of 32 with over 5000 citations (according to Google Scholar). He released many ML/NLP toolkits and datasets, which are widely used in both academia and industry. He also created large language models and other foundation models, including PhoGPT, PhoBERT, BARTpho, XPhoneBERT and BERTweet, with millions of downloads.

An overview of foundation models for Vietnamese language processing

Abstract.  In this talk, I will provide a brief overview of foundation models for Vietnamese language processing, including encoder-only, decoder-only, and encoder-decoder architectures. I will then delve into the details of a 7.5B-parameter generative model series named PhoGPT for Vietnamese, which comprises the base pre-trained monolingual model PhoGPT-7B5 and its instruction-following variant, PhoGPT-7B5-Instruct.