VLSP 2018 - Named Entity Recognition for Vietnamese (VNER 2018)

1. Introduction

2. Task Description

3. Data

4. Evaluation Methods

5. References

Detail annotation guidelines (in Vietnamese)

1. Introduction

Named entities are phrases that contain the names of persons, organizations, locations, times and quantities, monetary values, percentages, etc. Named Entity Recognition – NER is the task of recognizing named entities in documents. NER is an important subtask of Information Extraction, which has attracted researchers all over the world since 1990s.

From 1995, the 6th Message Understanding Conference – MUC has started evaluating NER systems for English. Besides NER systems for English, NER systems for Dutch and Turkish were also evaluated in CoNLL 2002 and CoNLL 2003 Shared Tasks. In these evaluation tasks, four named entities were considered, consisting of names of persons, organizations, locations, and names of miscellaneous entities that do not belong to the previous three types. Recently, there have been some competitions about NER organized, e.g. The GermEval 2014 NER Shared Task.

For Vietnamese language, the VLSP 2016 evaluation campaign was the first effort towards a systematic comparison between the performance of Vietnamese NER systems. A training dataset of 16,858 tagged sentences (extracted from online news) containing 14,918 named entities was produced and freely distributable for research purposes.  

This second NER shared task for Vietnamese will deal with various kind of documents.

2. Task Description

The scope of the campaign this year is to evaluate the ability of recognizing NEs in three types, i.e. names of persons, organizations, and locations. Recognizing of other types of NEs will be covered in the campaigns next years.

3. Data

3.1. Data Types

Data are collected from electronic news papers published on webs. Three types of NEs are compatible with their descriptions in the CoNLL Shared Task 2003.

1. Locations:

  • roads (streets, motorways)
  • trajectories
  • regions (villages, towns, cities, provinces, countries, continents, dioceses, parishes)
  • structures (bridges, ports, dams)
  • natural locations (mountains, mountain ranges, woods, rivers, wells, fields, valleys, gardens, nature reserves, allotments, beaches, national parks)
  • public places (squares, opera houses, museums, schools, markets, airports, stations, swimming pools,hospitals, sports facilities, youth centers,parks, town halls, theaters, cinemas, galleries,camping grounds, NASA launch pads, clubhouses, universities, libraries, churches, medical centers, parking lots, playgrounds, cemeteries)
  • commercial places (chemists, pubs, restaurants, depots, hostels, hotels, industrial parks, nightclubs, music venues)
  • assorted buildings (houses, monasteries, creches, mills, army barracks, castles, retirement, homes, towers, halls, rooms, vicarages, courtyards)
  • abstract ``places'' (e.g. {\it the free world})

2. Organizations:

  • companies (press agencies, studios, banks, stock markets, manufacturers, cooperatives)
  • subdivisions of companies (newsrooms)
  • brands
  • political movements (political parties, terrorist, organizations)
  • government bodies (ministries, councils, courts, political unions of countries (e.g. the {\it U.N.}))
  • publications (magazines, newspapers, journals)
  • musical companies (bands, choirs, opera companies, orchestras)
  • public organizations (schools, universities, charities)
  • other collections of people (sports clubs, sports teams, associations, theaters companies, religious orders, youth organizations)

3. Persons:

  • first, middle and last names of people, animals and fictional characters, aliases

Examples of data:

  • Locations: Thành phố Hồ Chí Minh, Núi Bà Đen, Sông Bạch Đằng.
  • Organization: Công ty Formosa, Nhà máy thủy điện Hòa Bình.
  • Persons: proper name in “ông Lân”, “bà Hà”.

An entity can contain another entity, e.g. “Uỷ ban nhân dân Thành phố Hà Nội” is an organization, in which contains a location of “thành phố Hà Nội”.

Training data consist of two datasets. In the first dataset, data contain the information of word segmentation. The information of POS tags and word chunks tags can be also added by utilizing available tools. The second dataset is raw data, which contain only NE tags.

3.2. Data Format

Data contain only NE information.

An example:

“Anh Thanh là cán bộ Uỷ ban nhân dân Thành phố Hà Nội.”

<ENAMEX TYPE="PERSON"> Anh Thanh </ENAMEX> là cán bộ <ENAMEX TYPE="ORGANIZATION"> Uỷ ban nhân dân <ENAMEX TYPE="LOCATION"> thành phố Hà Nội </ENAMEX> </ENAMEX> .

4. Evaluation methods

The performance of NER systems will be evaluated by the F1 score and accuracy.

4.1. F measure:

                   F1 = 2 * P * R/(P + R)

      where P (Precision), and R (Recall) are determined as follows:

                   P = NE-true/NE-sys

                   R = NE-true/NE-ref


  • NE-ref: The number of NEs in gold data
  • NE-sys: The number of NEs in recognizing system
  • NE-true: The number of NEs which is correctly recognized by the system

4.2. Accuracy

A = The number of words which are correctly labeled/ The total number of words

Accuracy is only applied to output data which are formatted in the form of word segmentation.

The results of systems will be evaluated at both levels of NE labels.

5. References





... to be updated ...