VLSP 2025 challenge on Temporal QA

Important dates

June 23, 2025: Registration open
July 6, 2025: Training data release
July 15, 2025: Public test release
August 20, 2025: System submission deadline
August 30, 2025: Private test results release
September 10, 2025: Technical report submission
September 27, 2025: Notification of acceptance
October 3, 2025: Camera-ready deadline
October 29-30, 2025: Conference dates

Task Description

Objective: Build a system to answer temporal questions in Vietnamese across two sub-tasks: Date Arithmetic (date-arith), Duration Question Answering (durationQA). The system must extract and reason about temporal information to provide accurate answers related to dates, durations, and temporal relationships.

Sub-Task 1: Date Arithmetic (date-arith)
Description: The date-arith sub-task focuses on handling questions related to date calculations, such as adding or subtracting time intervals from a given date. This involves understanding and manipulating time expressions to compute answers based on the provided context.
Focus: Parse and manipulate temporal expressions to compute new dates.
Sub-Task 2: Duration Question Answering (durationQA)
Description: Answer questions about the duration of events or actions based on a given context. The system must extract duration-related information from text and use real-world knowledge to evaluate answer options, determining how long an event or action lasts.
Focus: Identify explicit or implied durations in the context (e.g., "6 years") and apply real-world reasoning to classify options as correct ("yes") or incorrect ("no") based on factual accuracy.

Dataset

We constructed the Vietnamese datasets for both sub-tasks through the following process:

Translation: Translated the original English datasets from TimeBench into Vietnamese. For Sub-task 1, the source data contains only a single question pattern; for Sub-task 2, the structure and intent of each example were carefully preserved.
Data Generation: Extended the Vietnamese datasets using two approaches: rule-based generation (covering five distinct question patterns for Sub-task 1) and GPT-based generation (for Sub-task 2), ensuring consistency with the original formats and semantics.

Data Format

The dataset is stored as JSON files in the folder dataset. Each entry follows the format described below for the three sub-tasks:

Sub-Task 1: Date Arithmetic (date-arith)
- Input:
  - A question requiring date calculation.
  - Contains a temporal reference (date/time) and an operation (e.g., add/subtract a time interval).
- Output:
  - A new date/time after performing the calculation (e.g., "Tháng 5, 2025").
Example:
{
"question": "Thời gian 1 năm và 2 tháng trước tháng 6, 1297 là khi nào?"
"answer": ["Tháng 4, 1296"],
"context": "",
}
Sub-Task 2: Duration Question Answering (durationQA)
- Input:
  - A context (text containing temporal information).
  - A question about the duration of an event/action.
  - A set of answer options.
- Output: A list of labels corresponding to each answer option:
  - "yes" if the option correctly describes the duration.
  - "no" if incorrect.
Example:
{
"context": "Tôi đang sửa chữa chiếc xe đạp bị hỏng.",
"options": ["30 phút", "1 tháng", "10 phút", "2 giờ"],
"qid": 54,
"question": "Mất thời gian bao lâu để sửa chữa chiếc xe đạp?"
"labels": ["yes", "no", "yes", "yes"],
}

Evaluation

System performance will be evaluated using a range of standard metrics, including Accuracy, Exact Match, Precision, Recall, and F1-score:

Evaluation Metrics

Accuracy: Used for Sub-Task 1 (Date Arithmetic). It is the percentage of system answers that exactly match the ground-truth answers.
Exact Match: Used for Sub-Task 2 (DurationQA). It evaluates whether the predicted label sequence matches exactly the ground-truth label sequence.
Precision: Ratio of correctly predicted "yes" answers to total "yes" predictions made by the system.
Recall: Ratio of correctly predicted "yes" answers to total actual "yes" answers in the ground truth.
F1-score: Harmonic mean of Precision and Recall, summarizing overall performance.

Evaluation is performed separately for each sub-task. The final evaluation report includes individual scores as well as aggregate performance across all tasks.

Example for Sub-Task 1: Date Arithmetic

Input:

{

"question": "Thời gian 1 năm và 2 tháng trước tháng 6, 1297 là khi nào?",

"context": "",

"answer": ["Tháng 4, 1296"]

}

System Prediction:

["Tháng 4, 1296"]

Accuracy: The prediction matches the ground-truth exactly.
→ Accuracy = 1.0

Example for Sub-Task 2: Duration Question Answering

Input:

{

"context": "Tôi đang sửa chữa chiếc xe đạp bị hỏng.",

"options": ["30 phút", "1 tháng", "10 phút", "2 giờ"],

"qid": 54,

"question": "Mất thời gian bao lâu để sửa chữa chiếc xe đạp?" }

"labels": ["yes", "no", "yes", "yes"],

System Prediction:

["yes", "no", "no", "yes"]

Metric Calculation:

Exact Match: System prediction ≠ ground truth.
→ Exact Match = 0.0
Precision: 2 correct "yes" predictions out of 2 total "yes" predictions.
→ Precision = 2 / 2 = 1.0
Recall: 2 correct "yes" predictions out of 3 actual "yes" in ground truth.
→ Recall = 2 / 3 ≈ 0.6667
F1-score: Harmonic mean of precision and recall.
→ F1 = 2 × (1.0 × 0.6667) / (1.0 + 0.6667) ≈ 0.8

Training and Test Data

The organizers will provide:

Training and Testing Datasets: Vietnamese sentences for the two sub-tasks, formatted in JSON.
Annotations: Labeled temporal entities, questions, and answers.
Structure: Contexts, questions, answer options (for durationQA), and ground-truth answers.

Submission

Submission instructions will be announced later.

Registration

Click here to register

Contact

Zalo Group: https://zalo.me/g/neggkh158

Organizers

Nguyen Thi Minh Huyen, email: ntmhuyen@gmail.com
Ha My Linh, email: halinh.hus@gmail.com
Pham Thi Duc, email: phamthiduc@hus.edu.vn
Ngo The Quyen: email: ngoquyenbg@vnu.edu.vn
Le Ngoc Toan, email: lengoctoan@hus.edu.vn

References

Chu, Zheng, et al. "Timebench: A comprehensive evaluation of temporal reasoning abilities in large language models." arXiv preprint arXiv:2311.17667 (2023).
Tan, Qingyu, Hwee Tou Ng, and Lidong Bing. "Towards benchmarking and improving the temporal reasoning capability of large language models." arXiv preprint arXiv:2306.08952 (2023).
Virgo, Felix, Fei Cheng, and Sadao Kurohashi. "Improving event duration question answering by leveraging existing temporal information extraction data." Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022.

Association for Vietnamese Language and Speech Processing

Search