This github repository contains code related to the paper A Large-Scale Annotated Multivariate Time Series Aviation Maintenance Dataset from the NGAFID.
There are two notebooks for reproducing experiments and one example notebook for viewing all flight data using dask dataframes.
The repository notebooks automatically download files hosted on Google Drive to run the benchmark experiments.
The full dataset can be downloaded from https://doi.org/10.5281/zenodo.6624956 or https://www.kaggle.com/datasets/hooong/aviation-maintenance-dataset-from-the-ngafid.
Please run the Dask Example notebook in the repository https://github.com/hyang0129/NGAFIDDATASET/blob/main/NGAFID_DATASET_DASK_EXAMPLE.ipynb.
In terms of data structure, the flight header dataframe uses the master index column to link to the index of the flight data dask dataframe.
If you wish to use this for machine learning, it is best to extract the relevant flights into a format that works with your framework. The benchmark experiments use an extracted version stored as numpy arrays.
There are two setups for the benchmark experiments.
To train InceptionTime or ConvMHSA models, run the notebook https://github.com/hyang0129/NGAFIDDATASET/blob/main/NGAFID_DATASET_TF_EXAMPLE.ipynb.
To train MiniRocket, run the notebook https://github.com/hyang0129/NGAFIDDATASET/blob/main/NGAFID_DATASET_MINIROCKET_EXAMPLE.ipynb
Download from https://doi.org/10.5281/zenodo.6624956 or https://www.kaggle.com/datasets/hooong/aviation-maintenance-dataset-from-the-ngafid.
Unzip the all_flights.tar.gz file. Then use python to access the dask dataframe.
import dask.dataframe as dd
import pandas as pd
flight_data_df = dd.read_parquet('all_flights/one_parq')
flight_header_df = pd.read_csv('all_flights/flight_header.csv', index_col = 'Master Index')
The flight header df has the following columns.
The flight data df has the following columns below and a timestep column, which describes the order of the rows within a flight, determined by the index. The order is from low to high (0 means first timestep).
Below is an sample of data from the flight data df. Due to dask partitioning, it only guarantees partition ordering along the Index column, but not the timesteps column. Please sort by the timesteps when you extract a flight.