Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend Chronos evaluation to all 28 datasets from the paper #281

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions experiments/amazon-chronos/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
.metaflow
.git

.ipynb_checkpoints
logs
results
**/logs

# editors/IDEs
.idea
*.iml
**/*.swp
**/*.ipynb

# tmp files
**/*.log
**/*.csv
**/*.arff

# binary files
**/*.whl
**/*.jar
**/*.zip
**/*.tar
**/*.tgz
**/*.bz2
**/*.png
**/*.sif
**/__pycache__
**/*.egg-info
10 changes: 10 additions & 0 deletions experiments/amazon-chronos/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
FROM pytorch/pytorch:2.1.2-cuda11.8-cudnn8-runtime

RUN apt-get update && \
apt-get upgrade -y && \
apt-get install -y git

ADD . /app/
WORKDIR /app

RUN python -m pip install "."
83 changes: 57 additions & 26 deletions experiments/amazon-chronos/README.md
Original file line number Diff line number Diff line change
@@ -1,48 +1,79 @@
# Amazon Chronos is 10% less accurate and 500% slower than training classical statistical models.
# Extended comparison of Chronos against the statistical ensemble

We present a fully reproducible comprehensive evaluation showcasing that a Statistical Ensemble, consisting of AutoARIMA, AutoETS, AutoCES, and DynamicOptimizedTheta, outperforms Amazon Chronos—a foundational model for time series forecasting with over 710 million parameters. Specifically, the **Statistical Ensemble demonstrates 10%, 10%, and 11% superior performance in CRPS, MASE, and SMAPE metrics, respectively**, and it is **5x faster**. This analysis spans over **50,000 unique time series** across M1, M3, M4, and Tourism datasets, robustly comparing these models.

# Introduction

The rise of foundational models in time series forecasting, such as Amazon Chronos, represents a significant leap forward, leveraging deep learning and massive datasets for model pre-training to enhance predictive accuracy. Amazon Chronos, in particular, is noteworthy for its extensive parameterization and ambitious scope. However, our study shows that a comparatively simpler approach, employing a Statistical Ensemble of traditional forecasting methods, yields better accuracy and computational efficiency. One year ago, we used the same [benchmark](https://github.com/Nixtla/statsforecast/tree/main/experiments/m3) to showcase how statistical models outperformed deep learning models.
We present an extension to the [original comparison by Nixtla](https://github.com/Nixtla/nixtla/tree/main/experiments/amazon-chronos) of Chronos [1] against the SCUM ensemble [2]. In this analysis on over 200K unique time series across 28 datasets from Benchmark II in the Chronos paper [1], we show that **zero-shot** Chronos models perform comparably to this strong ensemble of 4 statistical models while being significantly faster on average. We follow the original study as closely as possible, including loading task definitions from GluonTS and computing metrics using utilsforecast.

## Empirical Evaluation

This study considers over 50,000 unique time series from the M1, M3, M4, and Tourism datasets, spanning various time series frequencies. Chronos did not use these datasets in the training phase. We have also included comparisons to the Seasonal Naive model to provide a benchmark for traditional forecasting methods.
This study considers over 200K unique time series from Benchmark II in the Chronos paper, spanning various time series domains, frequencies, history lengths, and prediction horizons. Chronos did not use these datasets in the training phase, so this is a **zero-shot** evaluation of Chronos against the statistical ensemble fitted on these datasets. We report results for two sizes of Chronos, Large and Mini, to highlight the trade-off between forecast quality and inference speed. As in the [original benchmark](https://github.com/Nixtla/nixtla/tree/main/experiments/amazon-chronos), we have included comparisons to the seasonal naive baseline. For each model, we also report the aggregated relative score which is the geometric mean of the relative improvement over seasonal naive across datasets (see Sec. 5.4 of [1] for details).

## Results

Our findings are shown in the following table, showcasing the performance across different metrics: CRPS, MASE, SMAPE, and computational time (in seconds). The best results are highlighted in **bold** for ease of reference.
The CRPS, MASE, sMAPE, and inference time (in seconds) for each model across 28 datasets have been tabulated below. The best and second best results have been highlighted in **bold** and <u>underlined</u>. Note that the use of sMAPE is [discouraged by forecasting experts](https://otexts.com/fpp3/accuracy.html#percentage-errors) and we only report it here for completeness and parity with the previous benchmark.

<img width="1099" alt="image" src="https://github.com/Nixtla/nixtla/assets/10517170/4d4fe9f3-4251-4b95-bd9b-248fc283e97b">
<center>
<img width="1099" alt="image" src="./full_benchmark_results.png">
</center>

### Notes
- The original study by Nixtla used `batch_size=8` for all Chronos models. However, on the `g5.2xlarge` instance used in the benchmark, we can safely use batch size of 16 for Chronos (large) and batch size of 64 for Chronos (mini).
- The original Nixtla benchmark re-used compiled Numba code across experiments, while this is not feasible in the current setup because of the distributed compute environment. Therefore, the reported runtime for `StatisticalEnsemble` is on average ~45 seconds higher than in the original benchmark. This does not affect the overall conclusions and the runtime ranking of `StatisticalEnsemble` and Chronos models.
- Due to differences in task definitions and metric implementations, the numbers in the above table are not directly comparable with the results reported in the Chronos paper.

## Reproducibility

To ensure the reproducibility of our findings, the Statistical Ensemble experiments were conducted on an AWS c5a.24xlarge instance, equipped with 96 vCPUs and 192 GiB of RAM. In contrast, the experiments for Amazon Chronos were carried out on an AWS g5.4xlarge GPU instance, which includes 16 vCPUs, 64 GiB of RAM, and an NVIDIA A10G Tensor Core GPU with 24 GiB. All necessary code and detailed instructions for reproducing the experiments are available in this directory.

### Instructions
### Installation
Create a virtual environment and install the dependencies

1. Set up a Python environment:

```bash
mamba env create -f environment.yml
conda activate amazon-chronos
conda create -n chronos python=3.10
conda activate chronos
pip install -e .
```

2. Run the experiments as reported in the table:
### (Option 1) Running experiments locally
To evaluate a model sequentially on all 28 datasets considered in the benchmark, run the following command
```bash
python -m src.main --mode fcst_statsforecast
python -m src.main --mode fcst_chronos
python src/run_metaflow.py run --model=$MODEL_NAME --max-workers=1
```
where `$MODEL_NAME` can be one of `SeasonalNaive`, `StatisticalEnsemble`, `chronos_mini`, `chronos_small`, `chronos_base`, `chronos_large`.

3. Evaluate the results using:
We set `--max-workers=1` to ensure that each dataset is evaluated sequentially and the runtime is measured correctly.

```bash
python -m src.main --mode evaluation
```
Note that `StatisticalEnsemble` can take multiple hours to forecast for datasets with long time series and large `season_length` (e.g., ETT or ERCOT). Similarly, `chronos_large` takes a while to forecast for datasets with many individual time series (e.g., Dominick or M5). Therefore the full loop over all datasets would take more than a day.

### (Option 2) Running experiments in parallel using Metaflow
**Note that running experiments in the cloud will incur costs**

1. Configure Metaflow for parallel execution of jobs in the cloud. For example, this can be done by deploying the [Metaflow CloudFormation stack](https://github.com/Netflix/metaflow-tools/tree/master/aws/cloudformation) and [providing configuration details in `~/.metaflowconfig/config.json`](https://outerbounds.com/engineering/operations/configure-metaflow/).

2. Uncomment lines 83-88 in `src/run_metaflow.py` to enable parallel execution on AWS Batch.

3. Build Docker image used for experiments

```bash
bash build_docker.sh
```

4. Run the experiments using Metaflow. We use the same hardware configuration as recommended by Nixtla.

For Chronos models, we use `g5.2xlarge` instances with a single A10G GPU
```bash
python src/run_metaflow.py run --model="chronos_mini" --max-workers=28
```

For StatisticalEnsemble, we use `c5a.24xlarge` instances with 96 vCPU cores
```bash
BATCH_NUM_GPUS=0 BATCH_NUM_CPUS=96 BATCH_MEMORY_MB=190000 python src/run_metaflow.py run --model=StatisticalEnsemble --max-workers=28
```

Make sure that the compute environment associated with your AWS Batch job queue includes the respective instance types.

You can adjust `--max-workers` to change the number of instances running the experiments in parallel.

### Collecting the results
Run the notebook `collect_results.ipynb` to collect the results from Metaflow and compile the final results table.

### References
- **Statistical Ensemble Paper**: [A Simple Combination of Univariate Models](https://www.sciencedirect.com/science/article/abs/pii/S0169207019300585?via%3Dihub)
- **Amazon Chronos Paper**: [Chronos: Learning the Language of Time Series](https://arxiv.org/abs/2403.07815)

[1] [Chronos: Learning the Language of Time Series](https://arxiv.org/abs/2403.07815)
[2] [A Simple Combination of Univariate Models](https://www.sciencedirect.com/science/article/abs/pii/S0169207019300585?via%3Dihub)
48 changes: 48 additions & 0 deletions experiments/amazon-chronos/README_OLD.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Amazon Chronos is 10% less accurate and 500% slower than training classical statistical models.

We present a fully reproducible comprehensive evaluation showcasing that a Statistical Ensemble, consisting of AutoARIMA, AutoETS, AutoCES, and DynamicOptimizedTheta, outperforms Amazon Chronos—a foundational model for time series forecasting with over 710 million parameters. Specifically, the **Statistical Ensemble demonstrates 10%, 10%, and 11% superior performance in CRPS, MASE, and SMAPE metrics, respectively**, and it is **5x faster**. This analysis spans over **50,000 unique time series** across M1, M3, M4, and Tourism datasets, robustly comparing these models.

# Introduction

The rise of foundational models in time series forecasting, such as Amazon Chronos, represents a significant leap forward, leveraging deep learning and massive datasets for model pre-training to enhance predictive accuracy. Amazon Chronos, in particular, is noteworthy for its extensive parameterization and ambitious scope. However, our study shows that a comparatively simpler approach, employing a Statistical Ensemble of traditional forecasting methods, yields better accuracy and computational efficiency. One year ago, we used the same [benchmark](https://github.com/Nixtla/statsforecast/tree/main/experiments/m3) to showcase how statistical models outperformed deep learning models.

## Empirical Evaluation

This study considers over 50,000 unique time series from the M1, M3, M4, and Tourism datasets, spanning various time series frequencies. Chronos did not use these datasets in the training phase. We have also included comparisons to the Seasonal Naive model to provide a benchmark for traditional forecasting methods.

## Results

Our findings are shown in the following table, showcasing the performance across different metrics: CRPS, MASE, SMAPE, and computational time (in seconds). The best results are highlighted in **bold** for ease of reference.

<img width="1099" alt="image" src="https://github.com/Nixtla/nixtla/assets/10517170/4d4fe9f3-4251-4b95-bd9b-248fc283e97b">


## Reproducibility

To ensure the reproducibility of our findings, the Statistical Ensemble experiments were conducted on an AWS c5a.24xlarge instance, equipped with 96 vCPUs and 192 GiB of RAM. In contrast, the experiments for Amazon Chronos were carried out on an AWS g5.4xlarge GPU instance, which includes 16 vCPUs, 64 GiB of RAM, and an NVIDIA A10G Tensor Core GPU with 24 GiB. All necessary code and detailed instructions for reproducing the experiments are available in this directory.

### Instructions

1. Set up a Python environment:

```bash
mamba env create -f environment.yml
conda activate amazon-chronos
```

2. Run the experiments as reported in the table:

```bash
python -m src.main --mode fcst_statsforecast
python -m src.main --mode fcst_chronos
```

3. Evaluate the results using:

```bash
python -m src.main --mode evaluation
```

### References
- **Statistical Ensemble Paper**: [A Simple Combination of Univariate Models](https://www.sciencedirect.com/science/article/abs/pii/S0169207019300585?via%3Dihub)
- **Amazon Chronos Paper**: [Chronos: Learning the Language of Time Series](https://arxiv.org/abs/2403.07815)
11 changes: 11 additions & 0 deletions experiments/amazon-chronos/build_docker.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash
set -euxo pipefail

TAG=${1:-"nixtla-eval"}

export METAFLOW_BATCH_CONTAINER_REGISTRY=$(cat ~/.metaflowconfig/config.json | jq -r .METAFLOW_BATCH_CONTAINER_REGISTRY)
export METAFLOW_BATCH_CONTAINER_IMAGE=$(cat ~/.metaflowconfig/config.json | jq -r .METAFLOW_BATCH_CONTAINER_IMAGE)

docker build -t $METAFLOW_BATCH_CONTAINER_REGISTRY/$METAFLOW_BATCH_CONTAINER_IMAGE:$TAG .
aws ecr get-login-password --region $(aws configure get region) | docker login --username AWS --password-stdin $METAFLOW_BATCH_CONTAINER_REGISTRY
docker push $METAFLOW_BATCH_CONTAINER_REGISTRY/$METAFLOW_BATCH_CONTAINER_IMAGE:$TAG
Loading