VLSP 2025 challenge on Temporal QA
Important dates
- June 23, 2025: Registration open
- July 6, 2025: Training data release
- July 15, 2025: Public test release
- August 20, 2025: System submission deadline
- August 30, 2025: Private test results release
- September 10, 2025: Technical report submission
- September 27, 2025: Notification of acceptance
- October 3, 2025: Camera-ready deadline
- October 29-30, 2025: Conference dates
Task Description
Objective: Build a system to answer temporal questions in Vietnamese across three sub-tasks: Date Arithmetic (date-arith), Duration Question Answering (durationQA). The system must extract and reason about temporal information to provide accurate answers related to dates, durations, and temporal relationships.
- Sub-Task 1: Date Arithmetic (date-arith)
Description: The date-arith sub-task focuses on handling questions related to date calculations, such as adding or subtracting time intervals from a given date. This involves understanding and manipulating time expressions to compute answers based on the provided context.
Focus: Parse and manipulate temporal expressions to compute new dates. - Sub-Task 2: Duration Question Answering (durationQA)
Description: Answer questions about the duration of events or actions based on a given context. The system must extract duration-related information from text and use real-world knowledge to evaluate answer options, determining how long an event or action lasts.
Focus: Identify explicit or implied durations in the context (e.g., "6 years") and apply real-world reasoning to classify options as correct ("yes") or incorrect ("no") based on factual accuracy.
Dataset
We constructed the Vietnamese datasets for both sub-tasks through the following process:
- Translation: Translated the original English datasets from TimeBench into Vietnamese. For Sub-task 1, the source data contains only a single question pattern; for Sub-task 2, the structure and intent of each example were carefully preserved.
- Data Generation: Extended the Vietnamese datasets using two approaches: rule-based generation (covering five distinct question patterns for Sub-task 1) and GPT-based generation (for Sub-task 2), ensuring consistency with the original formats and semantics.
Data Format
The dataset is stored as JSON files in the folder dataset
. Each entry follows the format described below for the three sub-tasks:
Sub-Task 1: Date Arithmetic (date-arith)
- Input:
- A question requiring date calculation.
- Contains a temporal reference (date/time) and an operation (e.g., add/subtract a time interval).
- Output:
- A new date/time after performing the calculation (e.g., "Tháng 5, 2025").
Example:
{
"question": "Thời gian 1 năm và 2 tháng trước tháng 6, 1297 là khi nào?"
"answer": ["Tháng 4, 1296"],
"context": "",
}
- Input:
Sub-Task 2: Duration Question Answering (durationQA)
- Input:
- A context (text containing temporal information).
- A question about the duration of an event/action.
- A set of answer options.
- Output: A list of labels corresponding to each answer option:
- "yes" if the option correctly describes the duration.
- "no" if incorrect.
Example:
{
"context": "Tôi đang sửa chữa chiếc xe đạp bị hỏng.",
"options": ["30 phút", "1 tháng", "10 phút", "2 giờ"],
"qid": 54,
"question": "Mất thời gian bao lâu để sửa chữa chiếc xe đạp?"
"labels": ["yes", "no", "yes", "yes"],
}
- Input:
Evaluation
System performance will be evaluated using a range of standard metrics, including Accuracy, Exact Match, Precision, Recall, and F1-score:
Evaluation Metrics
- Accuracy: Used for Sub-Task 1 (Date Arithmetic). It is the percentage of system answers that exactly match the ground-truth answers.
- Exact Match: Used for Sub-Task 2 (DurationQA). It evaluates whether the predicted label sequence matches exactly the ground-truth label sequence.
- Precision: Ratio of correctly predicted "yes" answers to total "yes" predictions made by the system.
- Recall: Ratio of correctly predicted "yes" answers to total actual "yes" answers in the ground truth.
- F1-score: Harmonic mean of Precision and Recall, summarizing overall performance.
Evaluation is performed separately for each sub-task. The final evaluation report includes individual scores as well as aggregate performance across all tasks.
Example for Sub-Task 1: Date Arithmetic
Input:
{
"question": "Thời gian 1 năm và 2 tháng trước tháng 6, 1297 là khi nào?",
"context": "",
"answer": ["Tháng 4, 1296"]
}
System Prediction:
["Tháng 4, 1296"]
- Accuracy: The prediction matches the ground-truth exactly.
→ Accuracy = 1.0
Example for Sub-Task 2: Duration Question Answering
Input:
{
"context": "Tôi đang sửa chữa chiếc xe đạp bị hỏng.",
"options": ["30 phút", "1 tháng", "10 phút", "2 giờ"],
"qid": 54,
"question": "Mất thời gian bao lâu để sửa chữa chiếc xe đạp?" }
"labels": ["yes", "no", "yes", "yes"],
System Prediction:
["yes", "no", "no", "yes"]
Metric Calculation:
- Exact Match: System prediction ≠ ground truth.
→ Exact Match = 0.0 - Precision: 2 correct "yes" predictions out of 2 total "yes" predictions.
→ Precision = 2 / 2 = 1.0 - Recall: 2 correct "yes" predictions out of 3 actual "yes" in ground truth.
→ Recall = 2 / 3 ≈ 0.6667 - F1-score: Harmonic mean of precision and recall.
→ F1 = 2 × (1.0 × 0.6667) / (1.0 + 0.6667) ≈ 0.8
Training and Test Data
The organizers will provide:
- Training and Testing Datasets: Vietnamese sentences for the two sub-tasks, formatted in JSON.
- Annotations: Labeled temporal entities, questions, and answers.
- Structure: Contexts, questions, answer options (for
durationQA
), and ground-truth answers.
Submission
Submission instructions will be announced later.
Registration
Contact
Zalo Group: https://zalo.me/g/neggkh158
Organizers
- Nguyen Thi Minh Huyen, email: ntmhuyen@gmail.com
- Ha My Linh, email: halinh.hus@gmail.com
- Pham Thi Duc, email: phamthiduc@hus.edu.vn
- Ngo The Quyen: email: ngoquyenbg@vnu.edu.vn
- Le Ngoc Toan, email: lengoctoan@hus.edu.vn
References
- Chu, Zheng, et al. "Timebench: A comprehensive evaluation of temporal reasoning abilities in large language models." arXiv preprint arXiv:2311.17667 (2023).
- Tan, Qingyu, Hwee Tou Ng, and Lidong Bing. "Towards benchmarking and improving the temporal reasoning capability of large language models." arXiv preprint arXiv:2306.08952 (2023).
- Virgo, Felix, Fei Cheng, and Sadao Kurohashi. "Improving event duration question answering by leveraging existing temporal information extraction data." Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022.
Sponsors and Partners