Data and code for ACL 2022 paper "MultiHiertt: Numerical Reasoning over Multi Hierarchical Tabular and Textual Data"
- python 3.9.7
- pytorch 1.10.2,
- pytorch-lightning 1.5.10
- huggingface transformers 4.18.0
- run
pip install -r requirements.txt
to install rest of the dependencies
- The leaderboard for the private test data is held on CodaLab
training_configs/ & inference_configs/ # Configuration files for training and inference
models/ # Implementation for each module
datasets/ # Dataloaders
callbacks/ # Callbacks for saving predictions
utils/ # Utilities for modules
txt_files/ # Txt files such as constant_list.txt, etc
output/ # Predictions and intermediate results
checkpoint/ # convert inference of Fact Retrieving & Question Type Classification Module into model input of Reasoning Modules.
The dataset is stored as json files Download Link, each entry has the following format:
"uid": unique example id;
"paragraphs": the list of sentences in the document;
"tables": the list of tables in HTML format in the document;
"table_description": the list of table descriptions for each data cell in tables. Generated by the pre-processing script;
"qa": {
"question": the question;
"answer": the answer;
"program": the reasoning program;
"text_evidence": the list of indices of gold supporting text facts;
"table_evidence": the list of indices of gold supporting table facts;
We provide the model checkpoints in Hugging Face. Download them (*.ckpt
) into the directory checkpoints
- Edit
to set your own project and data path. - Run the following commands to train the model.
export PYTHONPATH=`pwd`; python {fit, validate} --config training_configs/*_finetuning.yaml
- Edit
to set your own project and data path. - Run the following commands to get the intermediate results for {Train, Dev, Test} set, respectively.
export PYTHONPATH=`pwd`; python predict --ckpt_path checkpoints/*_model.ckpt --config inference_configs/*_inference.yaml
where checkpoints/*_model.ckpt
can be replaced by the checkpoint path from training stage. And the inference set or files should be specified in *_inference.yaml.
output/retriever_output/{train, dev, test}.json
&output/question_classification_output/{train, dev, test}.json
from Step 1. -
Run the following commands to convert predictions of Fact Retrieving & Question Type Classification Module for {Train, Dev, Test} into model input of Reasoning Module, respectively.
The output files are stored in dataset/reasoning_module_input
, where *_training.json
is used for the training stage and *_inference.json
is used for the inference stage.
- Edit
to set your own project and data path. - Run the following commands to train the model and generate the prediction files.
export PYTHONPATH=`pwd`; python fit --config training_configs/*_finetuning.yaml
- Edit
to set your own project and data path. - Run the following commands to get the prediction file for {Dev, Test} set
export PYTHONPATH=`pwd`; python predict --ckpt_path checkpoints/*_model.ckpt --config inference_configs/*_inference.yaml
where checkpoints/*_model.ckpt
can be replaced by the checkpoint path from training stage. And the inference set or files should be specified in *_inference.yaml.
Run the following commands to get the prediction file for {Dev, Test} set (and the performance on the Dev set), respectively.
python dataset/{test, dev}.json
The prediction file with the following format will be generated in the directory output/final_predictions
"uid": "bd2ce4dbf70d43e094d93d314b30bd39",
"predicted_ans": "106.0",
"predicted_program": []
For test set, Please zip the generated test prediction file test_predictions.json
; and submit
to CodaLab to get the final score. Please exactly match the filename.
For any issues or questions, kindly email us at: Yilun Zhao
title = "{M}ulti{H}iertt: Numerical Reasoning over Multi Hierarchical Tabular and Textual Data",
author = "Zhao, Yilun and
Li, Yunxiang and
Li, Chenying and
Zhang, Rui",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "",
pages = "6588--6600",