VLSP 2025 challenge on Numerical Reasoning QA

Important dates

June 23, 2025: Registration open

July 10, 2025: Training data, public test release

August 20, 2025: Private test release

August 20, 2025: System submission deadline

August 30, 2025: Private test results release

September 10, 2025: Technical report submission

September 27, 2025: Notification of acceptance

October 3, 2025: Camera-ready deadline

October 29-30, 2025: Conference dates

Task Description

This shared task aims to develop specialized models that not only provide correct answers to financial questions but also generate transparent reasoning paths, ensuring trust and understanding in the results. By fostering innovation in this area, the task will contribute to financial literacy and technological advancement in Vietnam.

Data and Resources Provided

Training Data

Translated FinQA dataset: The original FinQA (Z Chen 2021) dataset has been preprocessed and translated into Vietnamese to serve as a primary resource for financial question answering tasks.
Vietnamese financial reports: A collection of financial data extracted from publicly available Vietnamese financial reports, spanning the period from 2020 to 2025.
Additional datasets: Participants are encouraged to enhance their training data with other publicly available or appropriately licensed Vietnamese financial-domain datasets to improve model robustness and performance.

Evaluation Data

Public evaluation dataset: Released for initial model validation and development
Private evaluation dataset: Used for final ranking and evaluation (held by organizers)

Dataset - format

The dataset is stored as json files in folder "dataset", each entry has the following format:

"pre_text": the text before the table;
"post_text": the text after the table;
"table": the table;
"id": unique example id, composed by the original report name and the example index for the question.

"qa": {
"question": the question;
"program": the reasoning program;
"exe_ans": the gold execution result;
}

Example:

{
"pre_text": [
"mục đích phát triển dự án bất động sản của doanh nghiệp .",
"hệ số nợ trên vốn chủ sở hữu ( d / e ) của công ty cổ phần phát triển đô thị từ liêm là 0.1 lần và thường xuyên ở mức thấp , tránh rủi ro thanh khoản cho doanh nghiệp .",
"chúng tôi nhận định triển vọng thời gian tới của công ty cổ phần phát triển đô thị từ liêm khả quan nhờ vào : dự án bãi muối sẽ tiếp tục mang lại doanh thu lớn cho công ty cổ phần phát triển đô thị từ liêm khi bàn giao mạnh trong năm 2024 ( hiện còn 516 tỷ đồng giá trị hàng tồn kho ) , đồng thời do doanh nghiệp đã hoàn thành nghĩa vụ tài chính từ lâu và cơ bản hoàn thiện hạ tầng cơ bản , do đó sẽ có biên lợi nhuận lớn khi đưa vào kinh doanh xét đến giá đất trong khu vực đã ghi nhận mức tăng đáng kể trong các năm qua .",
"hiệu quả dòng tiền dự báo ở mức cao do đây là một trong những dự án hiếm hoi tại thành phố hạ long được kinh doanh theo hình thức chuyển nhượng đất nền ( triển khai theo quy định pháp lý cũ ) .",
"các dự án khác của công ty cổ phần phát triển đô thị từ liêm đang được triển khai .",
"ngoài dự án bãi muối , công ty cổ phần phát triển đô thị từ liêm hiện còn tồn kho lớn nhất tại dự án dịch vọng ( cầu giấy , hà nội ) với hơn 395 tỷ đồng .",
"dự án đang vướng về giải phóng mặt bằng , công ty đang tiếp tục thỏa thuận và xin gia hạn dự án , trong năm 2023 giá trị hàng tồn kho của dự án đã tăng hơn 100 tỷ đồng có thể là tín hiệu tốt về khả năng triển khai .",
"tình hình tài chính tốt với nợ vay thấp , tỷ lệ nợ trên vốn chủ sở hữu ( d / e ) cũng duy trì ở mức thấp trong nhiều năm và chiến lược kinh doanh tương đối thận trọng , chỉ tiến hành mở bán dự án khi hoàn tất pháp lý và điều kiện mở bán , qua đó giúp doanh nghiệp ít chịu rủi ro liên quan đến pháp lý dự án ."
],
"post_text": [
"."
],
"table": [
[
"chỉ tiêu ( tỷ đồng )","quý 4 năm 2022","quý 4 năm 2023","tỷ suất tăng trưởng quý 4 năm 2023 so với quý 4 năm 2022","2022","2023","tỷ suất tăng trưởng năm 2023 so với năm 2022"
],
[
"doanh thu thuần","116","747","542.4%","391","914","133.6%"
],
[
"lợi nhuận gộp","12","488","3840.9%","163","513","214.6%"
],
[
"biên lợi nhuận gộp ( % )","10.7%","65.4%","","41.7%","56.2%",""
],
[
"chi phí bán hàng & quản lí doanh nghiệp","-9","-11","16.3%","-35","-31","-11.2%"
],
[
"tỷ lệ chi phí bán hàng & quản lí doanh nghiệp / doanh thu thuần ( % )","-7.8%","-1.4%","","-8.9%","-3.4%",""
],
[
"doanh thu tài chính","2","2","-28.8%","6","2","-62.7%"
],
[
"chi phí tài chính","0","-6","912.5%","0","-8","2568.7%"
],
[
"chi phí lãi vay","","-6","","","-9",""
],
[
"lợi nhuận thuần từ hoạt động kinh doanh","6","473","8343.1%","134","477","255.2%"
],
[
"lợi nhuận trước thuế","5","458","8773.5%","134","463","246.6%"
],
[
"lợi nhuận sau thuế sau lợi ích cổ đông thiểu số","4","363","8885.7%","107","367","244.4%"
],
[
"biên lợi nhuận ròng ( % )","3.5%","48.6%","","27.2%","40.1%",""
]
],
"qa": {
"question": "doanh thu thuần năm 2023 gấp bao nhiêu lần doanh thu thuần năm 2022?",
"program": "divide(914, 391)",
"exe_ans": 2.337595907928389,
}
}

In the private test data, we only have the "question" field, no reference provided.

Evaluation

Previous studies on numerical reasoning in QA tasks have typically focused solely on execution accuracy, which evaluates whether the final answer produced by the generated program matches the gold answer. Notable examples include DROP (Dua et al., 2019) and MathQA (Amini et al., 2019). However, in the financial domain — where explainability and transparency are critical — this metric alone is insufficient.

In this shared task, in addition to evaluating execution accuracy, we also provide gold programs for the dataset and introduce an additional evaluation metric: program accuracy. This metric assesses whether the predicted reasoning program is logically and mathematically equivalent to the reference program, regardless of the specific numeric values used.

To perform this, we replace all arguments in both programs with symbolic variables and compare their structures. Two programs are considered equivalent if they perform the same sequence of operations, even if the order of commutative operations differs. For example, the following two programs are mathematically equivalent:
add(a1, a2), add(a3, a4), subtract(#0, #1)
add(a4, a3), add(a1, a2), subtract(#1, #0)

Evaluation Criteria

Prepare your prediction file into the following format, as a list of dictionaries, each dictionary contains two fields: the example id and the predicted program. The predicted program is a list of predicted program tokens with the 'EOF' as the last token. For example:
[
{ "id": "ETR/2016/page_23.pdf-2",
"predicted": [ "subtract(", "5829", "5735", ")", "EOF" ] },
{ "id": "INTC/2015/page_41.pdf-4",
"predicted": [ "divide(", "8.1", "56.0", ")", "EOF" ] },
... ]
Name the file as "predictions.json" and zip it.
The final submission format is:
results.zip
• results.json
Note that to create a valid submission, zip all the file with 'zip -r zipfilename *' starting from this directory. DO NOT zip the directory itself, just its contents. THIS IS VERY IMPORTANT.

Organizers

Nguyen Thi Minh Huyen
Ha My Linh
Vu Xuan Luong
Pham Thi Duc
Ngo The Quyen
Le Ngoc Toan
Phan Thi Hue
Le Van Cuong

Registration

https://forms.gle/uHqzB9V9epvZC45o7

Contact

Zalo group: https://zalo.me/g/etjtme178

Submission

(to be updated)

References

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019.
DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACLHLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 2368–
2378. Association for Computational Linguistics.
Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 2357–2367. Association for Computational Linguistics.
Chen, Zhiyu, et al. "Finqa: A dataset of numerical reasoning over financial data." arXiv preprint arXiv:2109.00122 (2021).

Association for Vietnamese Language and Speech Processing

Search