Named Entity Recognition for Vietnamese Language
1. Introduction
Named entities are phrases that contain the names of persons, organizations, locations, times and quantities, monetary values, percentages, etc. Named Entity Recognition – NER is the task of recognizing named entities in documents. NER is an important subtask of Information Extraction, which has attracted researchers all over the world since 1990s.
From 1995, the 6th Message Understanding Conference – MUC has started evaluating NER systems for English. Besides NER systems for English, NER systems for Dutch and Turkish were also evaluated in CoNLL 2002 and CoNLL 2003 Shared Tasks. In these evaluation tasks, four named entities were considered, consisting of names of persons, organizations, locations, and names of miscellaneous entities that do not belong to the previous three types. Recently, there have been some competitions about NER organized, e.g. The GermEval 2014 NER Shared Task.
For Vietnamese language, so far there is no systematic comparison between the performance of Vietnamese NER systems. The VLSP 2016 campaign, therefore, targets at providing an objective evaluation measurement about performance (quality) of NER tools, and encouraging the development of NER systems with high accuracy.
2. Task Description
The scope of the campaign this year is to evaluate the ability of recognizing NEs in three types, i.e. names of persons, organizations, and locations. Recognizing of other types of NEs will be covered in the campaigns next years.
3. Data
Data are collected from electronic news papers published on the web. Three types of NEs are compatible with their descriptions in the CoNLL Shared Task 2003.
1. Locations:
- roads (streets, motorways)
- trajectories
- regions (villages, towns, cities, provinces, countries, continents, dioceses, parishes)
- structures (bridges, ports, dams)
- natural locations (mountains, mountain ranges, woods, rivers, wells, fields, valleys, gardens, nature reserves, allotments, beaches, national parks)
- public places (squares, opera houses, museums, schools, markets, airports, stations, swimming pools,hospitals, sports facilities, youth centers,parks, town halls, theaters, cinemas, galleries,camping grounds, NASA launch pads, clubhouses, universities, libraries, churches, medical centers, parking lots, playgrounds, cemeteries)
- commercial places (chemists, pubs, restaurants, depots, hostels, hotels, industrial parks, nightclubs, music venues)
- assorted buildings (houses, monasteries, creches, mills, army barracks, castles, retirement, homes, towers, halls, rooms, vicarages, courtyards)
- abstract ``places'' (e.g. {\it the free world})
2. Organizations:
- companies (press agencies, studios, banks, stock markets, manufacturers, cooperatives)
- subdivisions of companies (newsrooms)
- brands
- political movements (political parties, terrorist, organizations)
- government bodies (ministries, councils, courts, political unions of countries (e.g. the {\it U.N.}))
- publications (magazines, newspapers, journals)
- musical companies (bands, choirs, opera companies, orchestras)
- public organizations (schools, universities, charities)
- other collections of people (sports clubs, sports teams, associations, theaters companies, religious orders, youth organizations)
3. Persons:
- first, middle and last names of people, animals and fictional characters, aliases
Examples of data:
• Locations: Thành phố Hồ Chí Minh, Núi Bà Đen, Sông Bạch Đằng.
• Organization: Công ty Formosa, Nhà máy thủy điện Hòa Bình.
• Persons: proper name in “ông Lân”, “bà Hà”.
An entity can contain another entity, e.g. “Uỷ ban nhân dân Thành phố Hà Nội” is an organization, in which contains a location of “thành phố Hà Nội”.
Training data consist of two datasets. In the first dataset, data contain the information of word segmentation. The information of POS tags and word chunks tags can be also added by utilizing available tools. The second dataset is raw data, which contain only NE tags.
4. Data Format
4.1 Dataset1
Data have been preprocessing with word segmentation and POS tagging. The data consist of five columns, in which two columns are separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence.
1. The first column is the word
2. The second column is its POS tag
3. The third column is its chunking tag
4. The fourth column is its NE label
5. The fifth column is its nested NE label
NE labels are annotated using the IOB notation as in the CoNLL Shared Tasks. There are 7 labels: B-PER and I-PER are used for persons, B-ORG and I-ORG are used for organizations, B-LOC and I-LOC are used for locations, and O is used for other elements.
An example for a Vietnamese sentence:
Anh | N | B-NP | O | O |
Thanh | Np | I-NP | B-PER | O |
là | V | B-VP | O | O |
cán_bộ | N | B-NP | O | O |
Uỷ ban | N | B-NP | B-ORG | O |
nhân_dân | N | I-NP | I-ORG | O |
Thành_phố | N | I-NP | I-ORG | B-LOC |
Hà_Nội | Np | I-NP | I-ORG | I-LOC |
. | . | O | O | O |
where {N, Np, V, E, .} are POS tags and {B-NP, I-NP, B-VP, O} are chunking tags.
Notes:
- Because POS tags and chunking tags are determined automatically by public tools, they may contain mistakes.
- For NEs, two main tags, i.e. B-XXX and I-XXX, are used. B-XXX is used for the first word of an NE in type XXX, and I-XXX is used for the other words of that NE. The O label is used for words which do not belong to any NE.
4.2. Dataset2
Data contain only NE information.
An example:
“Anh Thanh là cán bộ Uỷ ban nhân dân Thành phố Hà Nội.”
Anh <ENAMEX TYPE="PERSON"> Thanh </ENAMEX> là cán bộ <ENAMEX TYPE="ORGANIZATION"> Uỷ ban nhân dân <ENAMEX TYPE="LOCATION"> thành phố Hà Nội </ENAMEX> </ENAMEX> .
5. Evaluation methods
The performance of NER systems will be evaluated by the F1 score and accuracy.
5.1. F measure:
F1 = 2 * P * R/(P + R)
where P (Precision), and R (Recall) are determined as follows:
P = NE-true/NE-sys
R = NE-true/NE-ref
where:
- NE-ref: The number of NEs in gold data
- NE-sys: The number of NEs in recognizing system
- NE-true: The number of NEs which is correctly recognized by the system
5.2. Accuracy
A = The number of words which are correctly labeled/ The total number of words
Accuracy is only applied to output data which are formatted in the form of word segmentation.
The results of systems will be evaluated at both levels of NE labels.
6. References
http://www.cs.nyu.edu/cs/faculty/grishman/muc6.html
http://www.clips.uantwerpen.be/conll2002/ner/
http://www.cnts.ua.ac.be/conll2003/ner/
https://sites.google.com/site/germeval2014ner/