MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering

1. Abstract

We introduce Multimodal Time Series Benchmark (MTBench), a large-scale benchmark designed to evaluate large language models (LLMs) on time series and text understanding across financial and weather domains. MTBench comprises of paired time-series and textual data, including financial news with corresponding stock price movements and weather reports aligned with historical temperature records. Unlike existing benchmarks that focus on isolated modalities, MTBench provides a comprehensive testbed for models to jointly reason over structured numerical trends and unstructured textual narratives. The richness of MTBench enables formulation of diverse tasks that require a deep understanding of both text and time-series data, including time-series forecasting, semantic and technical trend analysis, and news-driven question answering (QA). These tasks target the model’s ability to capture temporal dependencies, extract key insights from textual context, and integrate cross-modal information. We evaluate state-of-the-art LLMs on MTBench analyzing their effectiveness in modeling the complex relationships between news narratives and temporal patterns. Our findings reveal significant challenges in current models, including difficulties in capturing long-term dependencies, interpreting causality in financial and weather trends, and effectively fusing multimodal information.

2. Folder Structure

MTBench/
│── data/                           # Downloaded datasets
    ├── raw/                        # Text or timeseries only dataset
    ├── processed/                  # Task-specific dataset
│── data_preparation/               # Dataset preparation scripts
    ├── weather/                    # Scripts for weather data processing
    ├── finance/                    # Scripts for financial data processing
│── evaluation/                     # Evaluation scripts for benchmarking
    ├── weather/                    # Evaluation scripts for weather data
    ├── finance/                    # Evaluation scripts for finance data
    |── api_call.py                 # Functions for calling LLM APIs
│── requirements.txt                # Dependencies
|── download_raw_dataset.py         # Download the raw dataset
|── download_processed_dataset.py   # Download all processed dataset
│── README.md                       # Project documentation

3. Dataset and Usage

MTBench introduces cross-domain dataset covering two domains: weather and finance. These datasets are designed to evaluate large language models (LLMs) on temporal reasoning and question-answering tasks. Each dataset consists of structured time series data and textual questions that require understanding of time-dependent trends.

Dependencies

Run the following commands to create a conda environment for MTBench

git clone https://github.com/Graph-and-Geometric-Learning/MTBench.git
cd MTBench

conda create -n MTBench python=3.10.14
source activate MTBench
pip install -r requirements.txt

Download Dataset

Download the raw dataset by running python download_raw_dataset.py. We provide the scripts to preprocess the raw data in data_prepraration. Scripts will:

generate trend labels and calcuate technical indicators
generate multi-choice QA samples and correlation labels

For your convenience, you can download all the processed data by

python download_processed_dataset.py

Dataset Distribution

Distributions of financial news impact duration and financial news categories:

Distributions of severe weather duration and their types:

Evaluation

To evaluate models on MTBench, you need to:

Set up API keys for LLMs in evaluation/api_call.py
Choose the domain, evaluation task and the setting
Run the corresponding evaluation script

For example, to evaluate time series trend classification on financial data, you need to set the arguments in evaluation/finance/run_trend_classification.sh :

API_NAME="gpt-4o"  # choose the LLM to be evaluated
MODE="combined"    # choose the input type, select from ["timeseries_only", "combined"]
IN_DAYS=30         # length of input time series
OUT_DAYS=7         # length of output time series

python trend_classification.py \
    --dataset_folder="../../data/processed/finance/aligned_in${IN_DAYS}days_out${OUT_DAYS}days" \
    --save_path="../../results/finance/trend_classification_in${IN_DAYS}_out${OUT_DAYS}/${API_NAME}_${MODE}" \
    --model=$API_NAME \
    --mode=$MODE

Then run the evaluation script:

  cd evaluation/finance
  bash run_trend_classification.sh

Results are saved to results/finance/trend_classification correspondingly.

4. Baseline Results

We benchmark several state-of-the-art LLMs. The performance varies across different temporal reasoning tasks, highlighting areas for improvement in existing LLMs.

Results on Finance Data

Evaluation on short-term finance data (e.g., 7-day input, 1-day output). "➡️" indicates the performance change between Time Series-Only and Time Series + Text Input.

	Trend Prediction (ACC)	Technical Indicator (MSE)	Correlation (ACC)	MCQA (ACC)
GPT-4o	40.93 ➡️ 42.81	0.430 ➡️ 0.365	53.6	65.1
Gemini	41.30 ➡️ 47.30	0.482 ➡️ 0.384	51.8	63.6
Claude	41.20 ➡️ 44.90	0.241 ➡️ 0.373	50.4	75.6
DeepSeek	40.53 ➡️ 45.12	0.435 ➡️ 0.352	50.0	77.6

Results on Weather Data

	Temperature Forecasting (MSE)	Trend Prediction (ACC)	Temperature Difference (MSE)	MCQA (ACC)
GPT-4o	21.67 ➡️ 17.55	23.07 ➡️ 43.54	27.06 ➡️ 18.84	41.7
Gemini	25.75 ➡️ 24.31	17.91 ➡️ 51.76	35.72 ➡️ 23.21	43.4
Claude	30.34 ➡️ 22.48	33.23 ➡️ 56.87	21.03 ➡️ 19.10	51.8
DeepSeek	31.02 ➡️ 29.38	16.89 ➡️ 25.17	49.28 ➡️ 44.99	46.7

5. Contribution and Future Work

We invite contributions to improve MTBench, including:

Expanding dataset diversity with new domains.
Enhancing task formulation for more complex temporal reasoning.
Developing evaluation metrics tailored for multimodal time series reasoning.
Designing novel and effective architectures and altorihtms for multimodal time series reasoning.

6. Citation and License

This code repository is licensed under the MIT License.

If you find MTBench useful, please consider citing our paper:

@article{chen2025mtbench,
  title={MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering},
  author={Chen, Jialin and Feng, Aosong and Zhao, Ziyu and Garza, Juan and Nurbek, Gaukhar and Qin, Cheng and Maatouk, Ali and Tassiulas, Leandros and Gao, Yifeng and Ying, Rex},
  journal={arXiv preprint arXiv:2503.16858},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
data_preparation		data_preparation
evaluation		evaluation
.gitignore		.gitignore
README.md		README.md
download_processed_dataset.py		download_processed_dataset.py
download_raw_dataset.py		download_raw_dataset.py
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering

Table of Contents

1. Abstract

2. Folder Structure

3. Dataset and Usage

Dependencies

Download Dataset

Dataset Distribution

Evaluation

4. Baseline Results

Results on Finance Data

Results on Weather Data

5. Contribution and Future Work

6. Citation and License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Graph-and-Geometric-Learning/MTBench

Folders and files

Latest commit

History

Repository files navigation

MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering

Table of Contents

1. Abstract

2. Folder Structure

3. Dataset and Usage

Dependencies

Download Dataset

Dataset Distribution

Evaluation

4. Baseline Results

Results on Finance Data

Results on Weather Data

5. Contribution and Future Work

6. Citation and License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages