Skip to main content

Association for Vietnamese Language and Speech Processing

A chapter of VAIP - Vietnam Association for Information Processing

VLSP 2022 – VTB Challenge: Vietnamese Constituency Parsing

Shared Task Registration Form

Important dates

  • July 27, 2022: Registration open.
  • Oct 1, 2022: Registration closed. Training data for development released.
  • Oct 15, 2022: Official training data released.
  • Nov 1, 2022: Release of the public test set.  
  • Nov 3, 2022: Online challenge started.
  • Nov 10, 2022: Private test released.
  • Nov 12, 2022: End of challenge.
  • November 15, 2022: Deadline for top 5 teams to submit technical reports. If any top teams did not submit their reports, follow-up teams can submit and take their places (follow-up teams are recommended to write their reports in advance and submit by this deadline).
  • November 26, 2022: Final winners announcement, result presentation and award ceremony (workshop day).

Task Description

Syntactic parsing is a fundamental problem in natural language processing. Syntax information plays an important role in many applications such as machine translation, information extraction, question answering, etc. Before 2015, the research community had witnessed the influence and success of statistical parsing models based on probabilistic context-free grammars following generative or discriminative approaches. From 2015 onward, deep learning-based parsing models have brought new successes to this problem, but mainly for popular languages such as English and Chinese.

With the main goal of promoting research on Vietnamese parsing and creating high-performance parsers for the community, the component parsing problem for Vietnamese was included in the shared task of the VLSP conference 2022.

The problem is to build a constituency parser for Vietnamese. Linguistically, constituency parsing is parsing based on a phrase structure grammar. In computational linguistics, the input to a constituency parser is a sentence, and the output is a constituency tree. For example, with the sentence "Nam làm bài tập", then the output can be the syntax tree as follows:

                             Sentence (S)
                                        |
         +------------------+-----------------+
         |                                                           |
Noun Phrase (NP)                        Verb Phrase (VP)
         |                                                           |
 Noun (N)                                +----------+-----------+
         |                                        |                                     |
     Nam                               Verb (V)             Noun Phrase (NP)
                                                  |                                      |
                                                làm                             Noun (N)
                                                                                          |
                                                                                    bài_tập

Participants can develop their model or build on existing open-source parsing systems (usually for other languages). Participants will be provided a Syntax Annotated Vietnamese corpus (Vietnamese Treebank) [1] with about 10,000 sentences belonging to the journalistic domain and socio-political topics. Participants can use additional resources such as Vietnamese raw text corpora to train word embedding models for their parser, or use using pre-trained word embeddings, ... The evaluation method used is Parseval [2] (with provided tools). The testing dataset consists of two types, the testing dataset in the same domain with the training data and the testing dataset outside the domain. The testing dataset outside the expected domain is legal (legal text) or biomedical text (biomedical text).

Data Format and Training Data

Participants will be provided Vietnamese Treebank – VTB [1] with about 10,000 sentences in bracketed-tree format as follow:

            (S (NP (N Nam)) (VP (V làm) (NP (N bài_tập))))

Part-of-speech (POS) tagset: Follow the POS tagset of the Vietnamese universal dependency treebank [3].

Constituency tagset:

No.Constituency tagDescription
1NPNoun phrase
2VPVerb phrase
3APAdjective phrase
4RPAdverb phrase
5PPPrepositional phrase
6QPQuantitative phrase
7MDPModal phrase
8UCPCoordinated phrase in which components are not the same type
9LSTList mark phrase
10WHNPInterrogative noun phrase ('aiwho', 'cái gìwhat', 'con gì'which)
11WHAPInterrogative adjective phrase ('lạnhcold thế nàohow', 'đẹpbeautiful ra saohow')
12WHRPInterrogative adverb phrase
13WHPPInterrogative prepositional phrase ('vớiwith aiwhom', 'bằngby cáchmethod nàowhich')
14SStatement sentence
15SQQuestion sentence
16SBARSubordinate clause (modifying noun, verb, and adjective)

Functional tagset:

No.Functional tagDescription
1HHead of phrase
2SUBSubject
3DOBDirect object
4IOBIndirect object
5TPCTopic
6PRDPredicate
7LGSLogical subject
8EXTFrequency or range complement
9VOCVocative
10TMPTemporal adjunct
11LOCLocation adjunct
12DIRDirection adjunct
13MNRManner adjunct
14PRPPurpose adjunct
15CNDCondition adjunct
16CNCCnc adjunct
17ADVAdverbial adjunct
18EXCExclamation sentence
19CMDCommand sentence

Null-element tagset:

No.Null-element tagDescription
1*T*Null element (trace within sentence)
2*E*Null element in ellipsis phenomenon
3*0*Null element in complementizer

Testing Data

Testing data is a list of Vietnamese sentences. The sentences have been segmented into words. For example:

                                                                Tôi đi Nha_Trang dự hội_thảo .

Result submission

Participants must submit the result in the same order as the testing data in the bracketed-tree format. In syntactic trees, only POS tags and consituency tags are required.

                                                                 (S (NP Tôi)
                                                                      (VP đi
                                                                              (NP (NNP Nha) (NNP Trang))
                                                                      (VP dự
                                                                              (NP hội_thảo)))

Evaluation Metric

The submission will be evaluated with ground-truth labels using Parseval metric [2]

Organisers

  • Nguyen Thi Minh Huyen
  • Phuong-Thai Nguyen
  • Xuan-Luong Vu

 Contact Us

  • vlsp.resources at gmail.com

References

[1] Phuong-Thai Nguyen, Xuan-Luong Vu, Thi-Minh-Huyen Nguyen, Van-Hiep Nguyen and Hong-Phuong Le. Building a Large Syntactically-Annotated Corpus of Vietnamese. The 3rd Linguistic Annotation Workshop (LAW), Singapore. Pages 182-185, 2009.

[2] Dan Jurafsky and James H. Martin. Speech and Language Processing (3rd ed. draft), Chapter 13. 2021 (https://web.stanford.edu/~jurafsky/slp3/13.pdf).

[3] POS Tagset and annotation guidelines

Sponsors and Partners

VinBIGDATA   VinIF  AIMESOFT  bee  Dagoras            

 

 zalo    VTCC  VCCorp

 

 

IOIT  HUS  USTH  UET    TLU  UIT  INT2  jaist  VIETLEX