VLSP 2021 - Named Entity Recognition for Vietnamese

1. Introduction

Named entities are phrases that contain the names of persons, organizations, locations, times and quantities, monetary values, percentages, etc. Named Entity Recognition – NER is the task of recognizing named entities in documents. NER is an important subtask of Information Extraction, which has attracted researchers all over the world since 1990s.

From 1995, the 6th Message Understanding Conference – MUC has started evaluating NER systems for English. Besides NER systems for English, NER systems for Dutch and Turkish were also evaluated in CoNLL 2002 and CoNLL 2003 Shared Tasks. In these evaluation tasks, four named entities were considered, consisting of names of persons, organizations, locations, and names of miscellaneous entities that do not belong to the previous three types.

In addition, NER systems has attracted many research groups' interest and is also one of the topics in major conferences. Especially, the SHINRA project has attracted a lot of teams to participate in classification task (http://shinra-project.info/shinra2021ml/). This is a task to classify 30 language Wikipedia pages into about 220 fine-grained Named Entity categories, with a huge training data (i.e. more than 100K pages).

For Vietnamese language, there are VLSP 2016 and VLSP 2018 that tackled this problem with 3 common NEs and 1 generic entity (MISC). This year competition will change the types of entities, specifically as described in part 2 and part 3.

 

2. Important dates

  • October 1, 2021: Training dataset available

  • October 1 - November 8, 2021: Challenge time

  • November 9, 2021: Testing dataset available

  • November 10, 2021: Submit results on testing dataset

  • November 15, 2021: Result notification

  • November 30, 2021: Technical report submission

  • December 10, 2021: Notification of acceptance

  • December 15, 2021: Camera-ready due

  • December 18, 2021: VLSP 2021 Workshop

3. Task Description

The scope of this year's campaign is to assess the ability to recognize entities in several categories (14 main categories, 24 subcategories and 1 generic). Entity types are described in detail in section 3, inspired from the Microsoft named entity types (https://docs.microsoft.com/en-in/azure/cognitive-services/text-analytics/named-entity-types?tabs=general). This is one of the next developments of VLSP 2018, with the definition of more entity types to be able to fully capture the meaningful entity information in the document. This will be one of the challenges and opportunities for the teams competing in this year's competition.

4. Data

4.1. Data Types

Data are collected from electronic news papers published on webs.

The main entity types are shortly described in the following table.

No. Category Description Note
1 Person Names of people.  
2 PersonType Job types or roles held by a person.  
3 Location Natural and human-made landmarks, structures, geographical features, and geopolitical entities.  
4 Organization Companies, political groups, musical bands, sport clubs, government bodies, and public organizations.  
5 Event Historical, social, and naturally occurring events.  
6 Product Physical objects of various categories.  
7 Skill A capability, skill, or expertise.  
8 Address Full mailing addresses.  
9 Phone number Phone numbers.  
10 Email Email addresses.  
11 URL URLs to websites.  
12 IP Network IP addresses.  
13 DateTime Dates and times of day.  
14 Quantity Numerical measurements and units.  

Training and development data consist of raw texts enriched with NE tags, classified in folders corresponding to different domains.

Detail annotation: NER Guidelines - 2021 (in Vietnamese).

4.2. Data Format

Data contain only NE information.

An example:

“Anh Thanh là cán bộ Uỷ ban nhân dân Thành phố Hà Nội.”

<ENAMEX TYPE="PERSON"> Anh Thanh </ENAMEX> là cán bộ <ENAMEX TYPE="ORGANIZATION"> Uỷ ban nhân dân <ENAMEX TYPE="LOCATION"> thành phố Hà Nội </ENAMEX> </ENAMEX> .

In addition, the organizers will provide the teams with column data in the basic formats of the NER system.

5. Evaluation methods

Evaluation data: The test set is a folder containing input test files without domain information. These files are similar to training and development files, except for annotation removal. Each team should submit the result in the same format with the training and development data.

Result submission: Each team can submit one or several system results in separated folders with numbered folder's name to precise the priority of their results.

The performance of NER systems will be evaluated by the F1 score (for each NE):

                   F1 = 2 * P * R/(P + R)

where P (Precision), and R (Recall) are determined as follows:

                   P = NE-true/NE-sys

                   R = NE-true/NE-ref

where:

  • NE-ref: The number of NEs in gold data

  • NE-sys: The number of NEs in recognizing system

  • NE-true: The number of NEs which is correctly recognized by the system

Then F1 score will be calculated for all labels (overall).

6. References

http://www.cs.nyu.edu/cs/faculty/grishman/muc6.html

http://www.clips.uantwerpen.be/conll2002/ner/

http://www.cnts.ua.ac.be/conll2003/ner/

https://sites.google.com/site/germeval2014ner/

... to be updated ...