Vietnamese Relation Extraction

Important dates

  • Sep 10, 2020: Registration open

  • Oct 15, 2020: Registration closed

  • Oct 20, 2020: Training data released

  • Nov 01, 2020: Development data released

  • Dec 05, 2020: Test set released

  • Dec 06, 2020: Test result submission

  • Dec 15, 2020: Technical report submission

  • Dec 18, 2020: Result announcement (workshop day)

1. Introduction

The rapid growth of volume and variety of news brings an unprecedented opportunity to explore electronic text but an enormous challenge when facing a massive amount of unstructured and semi-structured data. Recent research progress in text mining needs to be supported by Information Extraction (IE) and Natural Language Processing (NLP) techniques. One of the most fundamental sub-tasks of IE is Relation Extraction (RE). It is the task of identifying and determining the semantic relations between pairs of named entity mentions (or nominals) in the text [1]. Receiving the (set of) document(s) as an input, the relation extraction system aims to extract all pre-defined relationships mentioned in this document by identifying the corresponding entities and determining the type of relationship between each pair of entities.

Because of its motivations, several challenge evaluations have been organized to assess and advance relation extraction studies such as Semantic Evaluation (SemEval) [2, 3] and Automatic Content Extraction (ACE) [4, 5]. These challenges evaluations attracted many scientists worldwide to attend and publish their latest research on semantic relation extraction. Many approaches are proposed for RE in English texts, ranging from knowledge-based methods to machine learning-based methods [6, 7]. However, studies on this problem for Vietnamese text are still in the early stages with a few initial achievements.

2. Task description

The Relation Extraction task is proposed to underlie the intelligent processing of scientific documents by solving one of the fundamental information extraction tasks - Relation extraction. This task is centered around the classification of entity pairs in Vietnamese news text into four different, non-overlapping categories of semantic relations defined in advance.

This task only focused on intra-sentence relation extraction, i.e., we limit relations to only those expressed within a single sentence. The relations between entity mentions are annotated if and only if the relationship is explicitly referenced in the sentence that contains the two mentions. Even if there is a relationship between two entities in the real world (or elsewhere in the document), there must be evidence for that relationship in the local context where it is tagged.

3. Data

The dataset has been reused and developed from the VLSP-2018 Named Entity Recognition for Vietnamese (VNER 2018) task, collected from electronic newspapers published on the web. It was annotated with three types of Named Entities (NE): Locations (LOC), Organizations (ORG), and Persons (PER). Based on these three types of annotated NEs, we selected four relation types with coverage sufficiently broad to be of general and practical interest. Our selection is referenced and modified based on the relation types and subtypes used in the ACE 2005 task [4, 5]. We aimed at avoiding semantic overlap as much as possible. Some relation types are directed, i.e., their entities are asymmetry (order sensitive). The others are undirected (symmetric). The arguments of these relations are not ordered. Four relation types are described in Table. 1 and as follows (The detailed information is given in the annotation guideline).

No. Relation Arguments Directionality
1 LOCATED PER – LOC, ORG – LOC Directed
2 PART–WHOLE LOC – LOC, ORG – ORG, ORG-LOC Directed
3 PERSONAL–SOCIAL PER – PER Undirected
4 ORGANIZATION–AFFILIATION PER – ORG, PER-LOC, ORG – ORG, LOC-ORG Directed

Table 1. Relation types permitted arguments and directionality.

  • LOCATED

The located relation captures:

- The physical location of a person,

- The relationship between an organization and the location where it is located, based or does business.

It is a directed relation.

  • PART–WHOLE

- The geographical relation captures the location of a location, or organization in or at or as a part of another location, or organization,

- Subsidiary relation captures the administrative and other hierarchical relationships between organizations.

  • PERSONAL–SOCIAL

Personal-Social relations describe the relationship between people. Both arguments must be person entities. This relation type is symmetric. Example of this relation type includes:

- Relation captures the connection between two entities in any professional/ political /business relationship,

- The family/relative relationship,

- Another personal relationship.

  • ORGANIZATION–AFFILIATION

This relation type captures the following relationships:

- The relationship between a person and their employers (organization),

- The ownership between a person and an organization is owned by that person,

- The relationship between the founder/investor (person or organization) and an organization,

- The relationship between a person and an educational institution that this person attend(s/ed),

- The relationship between a person and an organization that he is a member (an elected government body, team, party, etc.),

- The relationship between a geopolitical location and an organization that it is a member,

- A person is a citizen/resident/etc. of a location.

4. Data Format

Training and development data consist of raw texts enriched with NE tags and RE information, separated in folders corresponding to different domains. The data is annotated following the BRAT [9] or BioC [10]  format.

5. Evaluation method

Evaluation data: The test set release will be formatted similarly with the training and development data, but without lines for the relation label. The task is to predict, given a sentence and two tagged entities, which of the relation labels to apply. The participated teams must submit the result in the same format with the training and development data.

Result submission: Each team can submit one or several result outputs.

The official evaluation measures are the macro-averaged F-score and micro-averaged F-score over the four relation labels. For each relation label rel:

- Each sentence labeled with rel in the gold standard will count as either a true positive (TP) or a false negative (FN), depending on whether it was correctly labeled by the system,

- Each sentence labeled with a different relation in the gold standard will count as a true negative (TN) or false positive (FP).

The F1-score of relation label l (F1) is the harmonic mean of Recall (R) and Precision (R), calculated as follows:

                   F1 = 2 * P * R/(P + R)

Precision and Recall are determined as follows:

                   P = TP/TP+FP

                   R = TP/TP+FN

6. References

[1] Aggarwal, C. C. (2015). Mining text data. In Data Mining (pp. 429-455). Springer, Cham.

[2] Hendrickx, I., Kim, S. N., Kozareva, Z., Nakov, P., Séaghdha, D. Ó., Padó, S., ... & Szpakowicz, S. (2010, July). SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations between Pairs of Nominals. In Proceedings of the 5th International Workshop on Semantic Evaluation (pp. 33-38).

[3] Gábor, K., Buscaldi, D., Schumann, A. K., QasemiZadeh, B., Zargayouna, H., & Charnois, T. (2018, June). Semeval-2018 Task 7: Semantic relation extraction and classification in scientific papers. In Proceedings of The 12th International Workshop on Semantic Evaluation (pp. 679-688).

[4] https://www.ldc.upenn.edu/collaborations/past-projects/ace/annotation-tasks-and-specifications

[5] Walker, C., Strassel, S., Medero, J., & Maeda, K. (2006). ACE 2005 multilingual training corpus. Linguistic Data Consortium, Philadelphia, 57, 45.

[6] Bach, N., & Badaskar, S. (2007). A review of relation extraction. Literature review for Language and Statistics II, 2, 1-15.

[7] Dongmei, L., Yang, Z., Dongyuan, L., & Danqiong, L. (2020). Review of Entity Relation Extraction Methods. Journal of Computer Research and Development, 57(7), 1424.

[8] https://vlsp.org.vn/vlsp2018/eval/ner

[9] https://brat.nlplab.org/installation.html

[10] https://bioc.sourceforge.net/