VLSP 2013 - WordSeg & POSTag Task

Introduction

Word segmentation and POS tagging are basic and difficult tasks in NLP, especially for isolating languages like Vietnamese in which compound words belong to the core of the language and the parts-of-speech are not well defined in the linguistic literature. A national project on Vietnamese Language and Speech Processing finished successfully in 2009 has brought to the researchers fundamental NLP tools and resources, thus initiating the appropriate setting to go further in researching, developing and deploying useful software applications in the field.

This campaign aims at automatically evaluating Vietnamese word segmentation and POS tagging systems, in order to encourage scientists to use and evaluate resources and tools from the VLSP project and permit to promote the most efficient methods for these basic tasks for Vietnamese processing.

Task Description

This evaluation includes two subtasks: word segmentation and POS tagging.

Participants in the evaluation can use either:

  • exclusively the training data provided by the campaign
  • or all kind of language resources

and must specify which of those two categories they wish to compete in.

Evaluation Metrics

  • Word segmentation:
    • P(recision): (number of words correctly segmented)/(number of words in the system output)
    • R(ecall): (number of words correctly segmented)/(number of words in the reference corpus)
    • F1 measure = 2*P*R/(P+R)
  • POS tagging:
    • P(recision): (number of words correctly tagged)/(number of words in the system output)
    • R(ecall): (number of words correctly tagged)/(number of words in the reference corpus)
    • F1 measure = 2*P*R/(P+R)

Training and Test Data

3 types of training data are available:

  • Segmented Corpus: This corpus contains about ??? sentences extracted from Vietnamese online news that are segmented into words.
  • POS Tagged Corpus: This corpus contains about 30,000 sentences extracted from Vietnamese online news that are segmented into words and each word are tagged with its part-of-speech.
  • Raw Corpus: This corpus contains about ??? unannotated sentences extracted from Vietnamese online news.

The test corpus includes two types of data. One contains sentences from Vietnamese news and another contains sentences from other categories of Vietnamese texts.  

Data Format

  • All data will be encoded as UTF-8 plain texts.
  • The input and output segmented corpora are formatted one word per line.
  • The input and output POS tagged corpora are formatted one word per line, each word is separated from the POS tag by a tab space.
  • The raw corpus is provided as is.

Copyrights of the data – Acknowledgment:

The annotated corpora provided by the campaign are collected from two sources:

  • VLSP project 
  • Vietnam Lexicography Center 

Participants should use these data for research purposes only and acknowledge the authorship of these data in their publications.