Skip to content

Latest commit

 

History

History
128 lines (94 loc) · 6.1 KB

README.md

File metadata and controls

128 lines (94 loc) · 6.1 KB

VQShape

VQShape is a pre-trained model for time-series analysis. The model provides shape-level features as interpretable representations for time-series data. The model also contains a universal codebook of shapes that generalizes to different domains and datasets. The current checkpoints are pre-trained on the UEA multivariate time-series classification datasets [Bagnall et al., 2018].

For more details, please refer to our paper:
Abstracted Shapes as Tokens - A Generalizable and Interpretable Model for Time-series Classification
(NeurIPS 2024)
Authors: Yunshi Wen, Tengfei Ma, Tsui-Wei Weng, Lam M. Nguyen, Anak Agung Julius

Note

This repository is still under construction since we are still working on cleaning up the source code. We aim to update this repository every one or two weeks. Future updates will include benchmarking scripts for forecasting, and imputation.

Usage

Environment Setup:

Python version: Python 3.11.

We use PyTorch Lightning to manage training, checkpointing, and potential future distributed training.

Install the dependencies with conda:

conda env create -f environment.yml

Install the dependencies with pip:

conda create -n vqshape python=3.11
conda activate vqshape
pip install -r requirements.txt

Data Preparation

Because of the memory configuration/limitation of our computation resources, we currently implement a lazy-loading mechanism for pre-training. To prepare the pre-training data, we first read the UEA multivariate time series and save each univariate time series into a csv file. Refer to this notebook for more details. The dataset can be replaced to fit specific computational resources.

Pre-training

Specify the codebook size and the embedding dimension. For example, for a codebook size of 64 and an embedding dimension of 512, run:

bash ./scripts/pretrain.sh 64 512

Other hyperparameters and configurations can be specified in the bash script.

Load the pre-trained checkpoint

The pre-trained checkpoint can be loaded efficiently using the PyTorch Lightning module. Here is an example of loading the checkpoint to a CUDA device:

from vqshape.pretrain import LitVQShape

checkpoint_path = "checkpoints/uea_dim512_codebook64/VQShape.ckpt"
lit_model = LitVQShape.load_from_checkpoint(checkpoint_path, 'cuda')
model = lit_model.model

Use the pre-trained model

1. Tokenization (extract shapes from time-series)

import torch
import torch.nn.functional as F
from einops import rearrange
from vqshape.pretrain import LitVQShape

# load the pre-trained model
checkpoint_path = "checkpoints/uea_dim512_codebook64/VQShape.ckpt"
lit_model = LitVQShape.load_from_checkpoint(checkpoint_path, 'cuda')
model = lit_model.model

x = torch.randn(16, 5, 1000)  # 16 multivariate time-series, each with 5 channels and 1000 timesteps
x = F.interpolate(x, 512, mode='linear')  # first interpolate to 512 timesteps
x = rearrange(x, 'b c t -> (b c) t')  # transform to univariate time-series

representations, output_dict = model(x, mode='tokenize') # tokenize with VQShape

token_representations = representations['token']
histogram_representations = representations['histogram']

2. Classification

Specify the codebook size and the embedding dimension. For benchmarking of the UEA datasets, run:

bash ./scripts/mtsc.sh 64 512

3. Forecasting

(Coming soon.)

4. Imputation

(Coming soon.)

Pre-trained Checkpoints

We provide the pre-trained checkpoints on the UEA time-series classification datasets, which produce the results in the paper.

The checkpoints are available at the release page. To use the checkpoints:

  1. Download the .zip file.
  2. Unzip the file into ./checkpoints.

Information of the pre-trained checkpoints:

Embedding Dimension Codebook Size Num. Parameters Checkpoint Mean Accuracy (Token) Mean Accuracy (Hist.)
256 512 9.5M download 0.731 0.711
512 512 37.1M download 0.723 0.709
512 128 37.1M download 0.720 0.711
512 64 37.1M download 0.719 0.712
512 32 37.1M download 0.706 0.696

Note

The checkpoint with embedding dimension 256 and codebook size 512 is not included in the paper, but produces the best performance. However, the checkpoint with embedding dimension 256 and codebook size 64 results in worse performance. We believe the scaling between embedding dimension and codebook size need further investigation.

Citation

@inproceedings{
    wen2024abstracted,
    title={Abstracted Shapes as Tokens - A Generalizable and Interpretable Model for Time-series Classification},
    author={Yunshi Wen and Tengfei Ma and Tsui-Wei Weng and Lam M. Nguyen and Anak Agung Julius},
    booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
    year={2024}
}

Acknowledgement

We thank the research community for the great work on time-series analysis, the open-source codebase, and the datasets, including but not limited to: