VLSP 2018 - Named Entity Recognition for Vietnamese (VNER 2018) | Association for Vietnamese Language and Speech Processing

Detail annotation guidelines (in Vietnamese)

1. Introduction

Named entities are phrases that contain the names of persons, organizations, locations, times and quantities, monetary values, percentages, etc. Named Entity Recognition – NER is the task of recognizing named entities in documents. NER is an important subtask of Information Extraction, which has attracted researchers all over the world since 1990s.

From 1995, the 6th Message Understanding Conference – MUC has started evaluating NER systems for English. Besides NER systems for English, NER systems for Dutch and Turkish were also evaluated in CoNLL 2002 and CoNLL 2003 Shared Tasks. In these evaluation tasks, four named entities were considered, consisting of names of persons, organizations, locations, and names of miscellaneous entities that do not belong to the previous three types. Recently, there have been some competitions about NER organized, e.g. The GermEval 2014 NER Shared Task.

For Vietnamese language, the VLSP 2016 evaluation campaign was the first effort towards a systematic comparison between the performance of Vietnamese NER systems. A training dataset of 16,858 tagged sentences (extracted from online news) containing 14,918 named entities was produced and freely distributable for research purposes.

This second NER shared task for Vietnamese will deal with various kind of documents.

2. Task Description

The scope of the campaign this year is to evaluate the ability of recognizing NEs in three types, i.e. names of persons, organizations, and locations. Recognizing of other types of NEs will be covered in the campaigns next years.

3. Data

3.1. Data Types

Data are collected from electronic news papers published on webs. Three types of NEs are compatible with their descriptions in the CoNLL Shared Task 2003.

1. Locations:

roads (streets, motorways)
trajectories
regions (villages, towns, cities, provinces, countries, continents, dioceses, parishes)
structures (bridges, ports, dams)
natural locations (mountains, mountain ranges, woods, rivers, wells, fields, valleys, gardens, nature reserves, allotments, beaches, national parks)
public places (squares, opera houses, museums, schools, markets, airports, stations, swimming pools,hospitals, sports facilities, youth centers,parks, town halls, theaters, cinemas, galleries,camping grounds, NASA launch pads, clubhouses, universities, libraries, churches, medical centers, parking lots, playgrounds, cemeteries)
commercial places (chemists, pubs, restaurants, depots, hostels, hotels, industrial parks, nightclubs, music venues)
assorted buildings (houses, monasteries, creches, mills, army barracks, castles, retirement, homes, towers, halls, rooms, vicarages, courtyards)
abstract ``places'' (e.g. {\it the free world})

2. Organizations:

companies (press agencies, studios, banks, stock markets, manufacturers, cooperatives)
subdivisions of companies (newsrooms)
brands
political movements (political parties, terrorist, organizations)
government bodies (ministries, councils, courts, political unions of countries (e.g. the {\it U.N.}))
publications (magazines, newspapers, journals)
musical companies (bands, choirs, opera companies, orchestras)
public organizations (schools, universities, charities)
other collections of people (sports clubs, sports teams, associations, theaters companies, religious orders, youth organizations)

3. Persons:

first, middle and last names of people, animals and fictional characters, aliases

Examples of data:

Locations: Thành phố Hồ Chí Minh, Núi Bà Đen, Sông Bạch Đằng.
Organization: Công ty Formosa, Nhà máy thủy điện Hòa Bình.
Persons: proper name in “ông Lân”, “bà Hà”.

An entity can contain another entity, e.g. “Uỷ ban nhân dân Thành phố Hà Nội” is an organization, in which contains a location of “thành phố Hà Nội”.

Training and development data consist of raw texts enriched with NE tags, classified in folders corresponding to different domains.

3.2. Data Format

Data contain only NE information.

An example:

“Anh Thanh là cán bộ Uỷ ban nhân dân Thành phố Hà Nội.”

<ENAMEX TYPE="PERSON"> Anh Thanh </ENAMEX> là cán bộ <ENAMEX TYPE="ORGANIZATION"> Uỷ ban nhân dân <ENAMEX TYPE="LOCATION"> thành phố Hà Nội </ENAMEX> </ENAMEX> .

4. Evaluation methods

Evaluation data: The test set is a folder containing input test files without domain information. These files are similar to training and development files, except for annotation removal. Each team should submit the result in the same format with the training and development data.

Result submission: Each team can submit one or several system results in separated folders with numbered folder's name to precise the priority of their results.

The performance of NER systems will be evaluated by the F1 score:

F1 = 2 * P * R/(P + R)

where P (Precision), and R (Recall) are determined as follows:

P = NE-true/NE-sys

R = NE-true/NE-ref

where: